'Living' Conclusion Gathering Space¶
Conclusions Data Merge¶
- There seems to be no pattern in the null values in target column. This indicates we can drop these rows
County
- we think county 12 is weird (unknown location) and introduces NA values (not included in weather data)
- maybe drop 12, but this may lead to other problems
- county named 'unknown'
- the forums say counties 16 and 0 may be weird
data_block_id
- we could reduce NaN and NaT values by excluding data_block_id 1 and 0 (beginning in data set)
Modelling / Time Series
- We are unsure about modelling (is time series model needed? maybe ARMA?) Forums suggest e.g. XGBoost
Conclusions EDA¶
- consumption has noticeable affects by winter holidays
- seems consumption is growing over time
- county 0 is dominating, Tallinn located there
- seems like temperature to production ratio changed last year
- product_type 2 attract producers with small installed capacity thus low production
- while product_type 3 attract the opposite cluster, producers with a lot of installed_capacity
- Surface solar radiation seems to have a stronger correlation with target than direct_solar
- There seems to be a 'split' around 6000 (unit?) daily mean target
- We expected more businesses in the top-producers
Preparing Data¶
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import plotly.express as px
client = pd.read_csv('../data/client.csv')
client.head()
| product_type | county | eic_count | installed_capacity | is_business | date | data_block_id | |
|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 108 | 952.89 | 0 | 2021-09-01 | 2 |
| 1 | 2 | 0 | 17 | 166.40 | 0 | 2021-09-01 | 2 |
| 2 | 3 | 0 | 688 | 7207.88 | 0 | 2021-09-01 | 2 |
| 3 | 0 | 0 | 5 | 400.00 | 1 | 2021-09-01 | 2 |
| 4 | 1 | 0 | 43 | 1411.00 | 1 | 2021-09-01 | 2 |
Client Data¶
# Datatype conversion
client.date = pd.to_datetime(client.date)
client.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 41919 entries, 0 to 41918 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 product_type 41919 non-null int64 1 county 41919 non-null int64 2 eic_count 41919 non-null int64 3 installed_capacity 41919 non-null float64 4 is_business 41919 non-null int64 5 date 41919 non-null datetime64[ns] 6 data_block_id 41919 non-null int64 dtypes: datetime64[ns](1), float64(1), int64(5) memory usage: 2.2 MB
# categoricals?
display(client.product_type.unique())
display(client.is_business.unique())
display(client.county.unique())
array([1, 2, 3, 0])
array([0, 1])
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15])
sns.scatterplot(x = client.date, y = client.data_block_id)
<Axes: xlabel='date', ylabel='data_block_id'>
display(min(client.date))
display(max(client.date))
display(client.data_block_id.nunique())
Timestamp('2021-09-01 00:00:00')
Timestamp('2023-05-29 00:00:00')
636
First day is '2021-09-01 00:00:00', last day is '2023-05-29 00:00:00'. There are 636 unique days, and data_block_id corresponds to date.
Electricity Prices Data¶
electricity_prices = pd.read_csv('../data/electricity_prices.csv')
electricity_prices.head()
| forecast_date | euros_per_mwh | origin_date | data_block_id | |
|---|---|---|---|---|
| 0 | 2021-09-01 00:00:00 | 92.51 | 2021-08-31 00:00:00 | 1 |
| 1 | 2021-09-01 01:00:00 | 88.90 | 2021-08-31 01:00:00 | 1 |
| 2 | 2021-09-01 02:00:00 | 87.35 | 2021-08-31 02:00:00 | 1 |
| 3 | 2021-09-01 03:00:00 | 86.88 | 2021-08-31 03:00:00 | 1 |
| 4 | 2021-09-01 04:00:00 | 88.43 | 2021-08-31 04:00:00 | 1 |
electricity_prices.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 15286 entries, 0 to 15285 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 forecast_date 15286 non-null object 1 euros_per_mwh 15286 non-null float64 2 origin_date 15286 non-null object 3 data_block_id 15286 non-null int64 dtypes: float64(1), int64(1), object(2) memory usage: 477.8+ KB
electricity_prices.forecast_date = pd.to_datetime(electricity_prices.forecast_date)
electricity_prices.origin_date = pd.to_datetime(electricity_prices.origin_date)
electricity_prices.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 15286 entries, 0 to 15285 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 forecast_date 15286 non-null datetime64[ns] 1 euros_per_mwh 15286 non-null float64 2 origin_date 15286 non-null datetime64[ns] 3 data_block_id 15286 non-null int64 dtypes: datetime64[ns](2), float64(1), int64(1) memory usage: 477.8 KB
electricity_prices.nunique()
forecast_date 15286 euros_per_mwh 11025 origin_date 15286 data_block_id 637 dtype: int64
electricity_prices.forecast_date.unique()
<DatetimeArray> ['2021-09-01 00:00:00', '2021-09-01 01:00:00', '2021-09-01 02:00:00', '2021-09-01 03:00:00', '2021-09-01 04:00:00', '2021-09-01 05:00:00', '2021-09-01 06:00:00', '2021-09-01 07:00:00', '2021-09-01 08:00:00', '2021-09-01 09:00:00', ... '2023-05-30 14:00:00', '2023-05-30 15:00:00', '2023-05-30 16:00:00', '2023-05-30 17:00:00', '2023-05-30 18:00:00', '2023-05-30 19:00:00', '2023-05-30 20:00:00', '2023-05-30 21:00:00', '2023-05-30 22:00:00', '2023-05-30 23:00:00'] Length: 15286, dtype: datetime64[ns]
Electricity price forecast are available for each hour of the day (637 days * 24 hours = 15286 unique datetimes) (for 637 days, one day more than client data; somewhere there are 2h missing)
Forecast Weather Data¶
forecast_weather = pd.read_csv('../data/forecast_weather.csv')
forecast_weather.head()
| latitude | longitude | origin_datetime | hours_ahead | temperature | dewpoint | cloudcover_high | cloudcover_low | cloudcover_mid | cloudcover_total | 10_metre_u_wind_component | 10_metre_v_wind_component | data_block_id | forecast_datetime | direct_solar_radiation | surface_solar_radiation_downwards | snowfall | total_precipitation | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 57.6 | 21.7 | 2021-09-01 00:00:00+00:00 | 1 | 15.655786 | 11.553613 | 0.904816 | 0.019714 | 0.000000 | 0.905899 | -0.411328 | -9.106137 | 1 | 2021-09-01 01:00:00+00:00 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 57.6 | 22.2 | 2021-09-01 00:00:00+00:00 | 1 | 13.003931 | 10.689844 | 0.886322 | 0.004456 | 0.000000 | 0.886658 | 0.206347 | -5.355405 | 1 | 2021-09-01 01:00:00+00:00 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 57.6 | 22.7 | 2021-09-01 00:00:00+00:00 | 1 | 14.206567 | 11.671777 | 0.729034 | 0.005615 | 0.000000 | 0.730499 | 1.451587 | -7.417905 | 1 | 2021-09-01 01:00:00+00:00 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 57.6 | 23.2 | 2021-09-01 00:00:00+00:00 | 1 | 14.844507 | 12.264917 | 0.336304 | 0.074341 | 0.000626 | 0.385468 | 1.090869 | -9.163999 | 1 | 2021-09-01 01:00:00+00:00 | 0.0 | 0.0 | 0.0 | 0.0 |
| 4 | 57.6 | 23.7 | 2021-09-01 00:00:00+00:00 | 1 | 15.293848 | 12.458887 | 0.102875 | 0.088074 | 0.000015 | 0.176590 | 1.268481 | -8.975766 | 1 | 2021-09-01 01:00:00+00:00 | 0.0 | 0.0 | 0.0 | 0.0 |
forecast_weather.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3424512 entries, 0 to 3424511 Data columns (total 18 columns): # Column Dtype --- ------ ----- 0 latitude float64 1 longitude float64 2 origin_datetime object 3 hours_ahead int64 4 temperature float64 5 dewpoint float64 6 cloudcover_high float64 7 cloudcover_low float64 8 cloudcover_mid float64 9 cloudcover_total float64 10 10_metre_u_wind_component float64 11 10_metre_v_wind_component float64 12 data_block_id int64 13 forecast_datetime object 14 direct_solar_radiation float64 15 surface_solar_radiation_downwards float64 16 snowfall float64 17 total_precipitation float64 dtypes: float64(14), int64(2), object(2) memory usage: 470.3+ MB
forecast_weather.origin_datetime = pd.to_datetime(forecast_weather.origin_datetime)
forecast_weather.forecast_datetime = pd.to_datetime(forecast_weather.forecast_datetime)
forecast_weather.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3424512 entries, 0 to 3424511 Data columns (total 18 columns): # Column Dtype --- ------ ----- 0 latitude float64 1 longitude float64 2 origin_datetime datetime64[ns, UTC] 3 hours_ahead int64 4 temperature float64 5 dewpoint float64 6 cloudcover_high float64 7 cloudcover_low float64 8 cloudcover_mid float64 9 cloudcover_total float64 10 10_metre_u_wind_component float64 11 10_metre_v_wind_component float64 12 data_block_id int64 13 forecast_datetime datetime64[ns, UTC] 14 direct_solar_radiation float64 15 surface_solar_radiation_downwards float64 16 snowfall float64 17 total_precipitation float64 dtypes: datetime64[ns, UTC](2), float64(14), int64(2) memory usage: 470.3 MB
forecast_weather.nunique()
latitude 8 longitude 14 origin_datetime 637 hours_ahead 48 temperature 400895 dewpoint 439461 cloudcover_high 212178 cloudcover_low 408609 cloudcover_mid 308004 cloudcover_total 698385 10_metre_u_wind_component 3166997 10_metre_v_wind_component 3142161 data_block_id 637 forecast_datetime 15312 direct_solar_radiation 1199360 surface_solar_radiation_downwards 1496773 snowfall 256201 total_precipitation 540661 dtype: int64
forecast_weather.groupby(['latitude', 'longitude']).nunique()
| origin_datetime | hours_ahead | temperature | dewpoint | cloudcover_high | cloudcover_low | cloudcover_mid | cloudcover_total | 10_metre_u_wind_component | 10_metre_v_wind_component | data_block_id | forecast_datetime | direct_solar_radiation | surface_solar_radiation_downwards | snowfall | total_precipitation | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| latitude | longitude | ||||||||||||||||
| 57.6 | 21.7 | 637 | 48 | 27706 | 28047 | 10902 | 14794 | 13125 | 16896 | 30563 | 30551 | 637 | 15312 | 14747 | 16068 | 4250 | 9540 |
| 22.2 | 637 | 48 | 28285 | 28292 | 11118 | 15006 | 13505 | 17130 | 30548 | 30550 | 637 | 15312 | 14555 | 16085 | 4389 | 9353 | |
| 22.7 | 637 | 48 | 27799 | 28120 | 11000 | 14869 | 13640 | 17178 | 30554 | 30556 | 637 | 15312 | 14561 | 16074 | 4327 | 9261 | |
| 23.2 | 637 | 48 | 27587 | 28120 | 10821 | 14289 | 13280 | 16737 | 30558 | 30556 | 637 | 15312 | 14670 | 16092 | 4415 | 9384 | |
| 23.7 | 637 | 48 | 27697 | 28045 | 11010 | 14380 | 13396 | 16817 | 30561 | 30560 | 637 | 15312 | 14648 | 16084 | 4462 | 9540 | |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 59.7 | 26.2 | 637 | 48 | 27959 | 28170 | 11356 | 14511 | 13565 | 16080 | 30561 | 30559 | 637 | 15312 | 14396 | 16066 | 5199 | 9652 |
| 26.7 | 637 | 48 | 28019 | 28199 | 11082 | 14412 | 13420 | 15878 | 30563 | 30560 | 637 | 15312 | 14375 | 16072 | 5204 | 9640 | |
| 27.2 | 637 | 48 | 28033 | 28129 | 11387 | 14400 | 13519 | 15858 | 30551 | 30563 | 637 | 15312 | 14507 | 16078 | 5318 | 9743 | |
| 27.7 | 637 | 48 | 28156 | 28283 | 10929 | 14201 | 13348 | 15712 | 30555 | 30556 | 637 | 15312 | 14401 | 16101 | 5466 | 9935 | |
| 28.2 | 637 | 48 | 28495 | 28373 | 11102 | 14822 | 13779 | 16164 | 30542 | 30553 | 637 | 15312 | 14325 | 16128 | 5729 | 10107 |
112 rows × 16 columns
There are 112 unique combinations of lat and long (unique weather stations).
So for each forecast_date, there are 112 observations (one from each station).
display(forecast_weather.hours_ahead.nunique())
display(forecast_weather.origin_datetime.nunique())
display(forecast_weather.groupby('origin_datetime').forecast_datetime.size()/112)
48
637
origin_datetime
2021-09-01 00:00:00+00:00 48.0
2021-09-02 00:00:00+00:00 48.0
2021-09-03 00:00:00+00:00 48.0
2021-09-04 00:00:00+00:00 48.0
2021-09-05 00:00:00+00:00 48.0
...
2023-05-26 00:00:00+00:00 48.0
2023-05-27 00:00:00+00:00 48.0
2023-05-28 00:00:00+00:00 48.0
2023-05-29 00:00:00+00:00 48.0
2023-05-30 00:00:00+00:00 48.0
Name: forecast_datetime, Length: 637, dtype: float64
forecast_weather.groupby(['origin_datetime', 'forecast_datetime']).count()
| latitude | longitude | hours_ahead | temperature | dewpoint | cloudcover_high | cloudcover_low | cloudcover_mid | cloudcover_total | 10_metre_u_wind_component | 10_metre_v_wind_component | data_block_id | direct_solar_radiation | surface_solar_radiation_downwards | snowfall | total_precipitation | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| origin_datetime | forecast_datetime | ||||||||||||||||
| 2021-09-01 00:00:00+00:00 | 2021-09-01 01:00:00+00:00 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 |
| 2021-09-01 02:00:00+00:00 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | |
| 2021-09-01 03:00:00+00:00 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | |
| 2021-09-01 04:00:00+00:00 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | |
| 2021-09-01 05:00:00+00:00 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2023-05-30 00:00:00+00:00 | 2023-05-31 20:00:00+00:00 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 |
| 2023-05-31 21:00:00+00:00 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | |
| 2023-05-31 22:00:00+00:00 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | |
| 2023-05-31 23:00:00+00:00 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | |
| 2023-06-01 00:00:00+00:00 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 | 112 |
30576 rows × 16 columns
Gas Prices Data¶
gas_prices = pd.read_csv('../data/gas_prices.csv')
gas_prices.head()
| forecast_date | lowest_price_per_mwh | highest_price_per_mwh | origin_date | data_block_id | |
|---|---|---|---|---|---|
| 0 | 2021-09-01 | 45.23 | 46.32 | 2021-08-31 | 1 |
| 1 | 2021-09-02 | 45.62 | 46.29 | 2021-09-01 | 2 |
| 2 | 2021-09-03 | 45.85 | 46.40 | 2021-09-02 | 3 |
| 3 | 2021-09-04 | 46.30 | 46.80 | 2021-09-03 | 4 |
| 4 | 2021-09-05 | 46.30 | 46.58 | 2021-09-04 | 5 |
gas_prices.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 637 entries, 0 to 636 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 forecast_date 637 non-null object 1 lowest_price_per_mwh 637 non-null float64 2 highest_price_per_mwh 637 non-null float64 3 origin_date 637 non-null object 4 data_block_id 637 non-null int64 dtypes: float64(2), int64(1), object(2) memory usage: 25.0+ KB
gas_prices.forecast_date = pd.to_datetime(gas_prices.forecast_date)
gas_prices.origin_date = pd.to_datetime(gas_prices.origin_date)
gas_prices.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 637 entries, 0 to 636 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 forecast_date 637 non-null datetime64[ns] 1 lowest_price_per_mwh 637 non-null float64 2 highest_price_per_mwh 637 non-null float64 3 origin_date 637 non-null datetime64[ns] 4 data_block_id 637 non-null int64 dtypes: datetime64[ns](2), float64(2), int64(1) memory usage: 25.0 KB
gas_prices.describe()
| forecast_date | lowest_price_per_mwh | highest_price_per_mwh | origin_date | data_block_id | |
|---|---|---|---|---|---|
| count | 637 | 637.000000 | 637.000000 | 637 | 637.000000 |
| mean | 2022-07-16 00:00:00 | 95.036750 | 107.754631 | 2022-07-15 00:00:00 | 319.000000 |
| min | 2021-09-01 00:00:00 | 28.100000 | 34.000000 | 2021-08-31 00:00:00 | 1.000000 |
| 25% | 2022-02-07 00:00:00 | 60.000000 | 67.530000 | 2022-02-06 00:00:00 | 160.000000 |
| 50% | 2022-07-16 00:00:00 | 85.210000 | 93.470000 | 2022-07-15 00:00:00 | 319.000000 |
| 75% | 2022-12-22 00:00:00 | 109.000000 | 130.740000 | 2022-12-21 00:00:00 | 478.000000 |
| max | 2023-05-30 00:00:00 | 250.000000 | 305.000000 | 2023-05-29 00:00:00 | 637.000000 |
| std | NaN | 47.552295 | 54.743666 | NaN | 184.030342 |
Historical Weather Data¶
historical_weather = pd.read_csv('../data/historical_weather.csv')
historical_weather.head()
| datetime | temperature | dewpoint | rain | snowfall | surface_pressure | cloudcover_total | cloudcover_low | cloudcover_mid | cloudcover_high | windspeed_10m | winddirection_10m | shortwave_radiation | direct_solar_radiation | diffuse_radiation | latitude | longitude | data_block_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2021-09-01 00:00:00 | 14.4 | 12.0 | 0.0 | 0.0 | 1015.8 | 4 | 4 | 0 | 0 | 6.694444 | 3 | 0.0 | 0.0 | 0.0 | 57.6 | 21.7 | 1 |
| 1 | 2021-09-01 00:00:00 | 14.0 | 12.0 | 0.0 | 0.0 | 1010.6 | 7 | 8 | 0 | 0 | 4.944444 | 353 | 0.0 | 0.0 | 0.0 | 57.6 | 22.2 | 1 |
| 2 | 2021-09-01 00:00:00 | 14.4 | 12.8 | 0.0 | 0.0 | 1014.9 | 6 | 7 | 0 | 0 | 5.833333 | 348 | 0.0 | 0.0 | 0.0 | 57.6 | 22.7 | 1 |
| 3 | 2021-09-01 00:00:00 | 15.4 | 13.0 | 0.0 | 0.0 | 1014.4 | 4 | 2 | 4 | 0 | 7.111111 | 349 | 0.0 | 0.0 | 0.0 | 57.6 | 23.2 | 1 |
| 4 | 2021-09-01 00:00:00 | 15.9 | 12.6 | 0.0 | 0.0 | 1013.8 | 12 | 7 | 0 | 20 | 8.388889 | 360 | 0.0 | 0.0 | 0.0 | 57.6 | 23.7 | 1 |
historical_weather.datetime = pd.to_datetime(historical_weather.datetime)
historical_weather.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1710800 entries, 0 to 1710799 Data columns (total 18 columns): # Column Dtype --- ------ ----- 0 datetime datetime64[ns] 1 temperature float64 2 dewpoint float64 3 rain float64 4 snowfall float64 5 surface_pressure float64 6 cloudcover_total int64 7 cloudcover_low int64 8 cloudcover_mid int64 9 cloudcover_high int64 10 windspeed_10m float64 11 winddirection_10m int64 12 shortwave_radiation float64 13 direct_solar_radiation float64 14 diffuse_radiation float64 15 latitude float64 16 longitude float64 17 data_block_id int64 dtypes: datetime64[ns](1), float64(11), int64(6) memory usage: 234.9 MB
historical_weather.isnull().sum()
datetime 0 temperature 0 dewpoint 0 rain 0 snowfall 0 surface_pressure 0 cloudcover_total 0 cloudcover_low 0 cloudcover_mid 0 cloudcover_high 0 windspeed_10m 0 winddirection_10m 0 shortwave_radiation 0 direct_solar_radiation 0 diffuse_radiation 0 latitude 0 longitude 0 data_block_id 0 dtype: int64
Train Data & Checking for NULL values¶
train = pd.read_csv('../data/train.csv')
train.head()
| county | is_business | product_type | target | is_consumption | datetime | data_block_id | row_id | prediction_unit_id | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 1 | 0.713 | 0 | 2021-09-01 00:00:00 | 0 | 0 | 0 |
| 1 | 0 | 0 | 1 | 96.590 | 1 | 2021-09-01 00:00:00 | 0 | 1 | 0 |
| 2 | 0 | 0 | 2 | 0.000 | 0 | 2021-09-01 00:00:00 | 0 | 2 | 1 |
| 3 | 0 | 0 | 2 | 17.314 | 1 | 2021-09-01 00:00:00 | 0 | 3 | 1 |
| 4 | 0 | 0 | 3 | 2.904 | 0 | 2021-09-01 00:00:00 | 0 | 4 | 2 |
train.datetime = pd.to_datetime(train.datetime, format='%Y-%m-%d %H:%M:%S')
train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2018352 entries, 0 to 2018351 Data columns (total 9 columns): # Column Dtype --- ------ ----- 0 county int64 1 is_business int64 2 product_type int64 3 target float64 4 is_consumption int64 5 datetime datetime64[ns] 6 data_block_id int64 7 row_id int64 8 prediction_unit_id int64 dtypes: datetime64[ns](1), float64(1), int64(7) memory usage: 138.6 MB
train.head()
| county | is_business | product_type | target | is_consumption | datetime | data_block_id | row_id | prediction_unit_id | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 1 | 0.713 | 0 | 2021-09-01 | 0 | 0 | 0 |
| 1 | 0 | 0 | 1 | 96.590 | 1 | 2021-09-01 | 0 | 1 | 0 |
| 2 | 0 | 0 | 2 | 0.000 | 0 | 2021-09-01 | 0 | 2 | 1 |
| 3 | 0 | 0 | 2 | 17.314 | 1 | 2021-09-01 | 0 | 3 | 1 |
| 4 | 0 | 0 | 3 | 2.904 | 0 | 2021-09-01 | 0 | 4 | 2 |
train.datetime[1000]
Timestamp('2021-09-01 08:00:00')
train.isnull().sum()
county 0 is_business 0 product_type 0 target 528 is_consumption 0 datetime 0 data_block_id 0 row_id 0 prediction_unit_id 0 dtype: int64
train.groupby('is_consumption').agg({'target': lambda x: x.isnull().sum()})
| target | |
|---|---|
| is_consumption | |
| 0 | 264 |
| 1 | 264 |
train[train.target.isnull()]
| county | is_business | product_type | target | is_consumption | datetime | data_block_id | row_id | prediction_unit_id | |
|---|---|---|---|---|---|---|---|---|---|
| 178938 | 0 | 0 | 1 | NaN | 0 | 2021-10-31 03:00:00 | 60 | 178938 | 0 |
| 178939 | 0 | 0 | 1 | NaN | 1 | 2021-10-31 03:00:00 | 60 | 178939 | 0 |
| 178940 | 0 | 0 | 2 | NaN | 0 | 2021-10-31 03:00:00 | 60 | 178940 | 1 |
| 178941 | 0 | 0 | 2 | NaN | 1 | 2021-10-31 03:00:00 | 60 | 178941 | 1 |
| 178942 | 0 | 0 | 3 | NaN | 0 | 2021-10-31 03:00:00 | 60 | 178942 | 2 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1806379 | 15 | 1 | 0 | NaN | 1 | 2023-03-26 03:00:00 | 571 | 1806379 | 64 |
| 1806380 | 15 | 1 | 1 | NaN | 0 | 2023-03-26 03:00:00 | 571 | 1806380 | 59 |
| 1806381 | 15 | 1 | 1 | NaN | 1 | 2023-03-26 03:00:00 | 571 | 1806381 | 59 |
| 1806382 | 15 | 1 | 3 | NaN | 0 | 2023-03-26 03:00:00 | 571 | 1806382 | 60 |
| 1806383 | 15 | 1 | 3 | NaN | 1 | 2023-03-26 03:00:00 | 571 | 1806383 | 60 |
528 rows × 9 columns
There seems to be no pattern in the null values in target column. This indicates we can drop these rows
weather_station_to_county_mapping = pd.read_csv('../data/weather_station_to_county_mapping.csv')
weather_station_to_county_mapping.head()
| county_name | longitude | latitude | county | |
|---|---|---|---|---|
| 0 | NaN | 21.7 | 57.6 | NaN |
| 1 | NaN | 21.7 | 57.9 | NaN |
| 2 | NaN | 21.7 | 58.2 | NaN |
| 3 | NaN | 21.7 | 58.5 | NaN |
| 4 | NaN | 21.7 | 58.8 | NaN |
weather_station_to_county_mapping.county.unique()
array([nan, 10., 1., 7., 6., 9., 0., 14., 3., 13., 11., 4., 5.,
15., 8., 2.])
weather_station_to_county_mapping.county_name.unique()
array([nan, 'Saaremaa', 'Hiiumaa', 'Pärnumaa', 'Läänemaa', 'Raplamaa',
'Harjumaa', 'Viljandimaa', 'Järvamaa', 'Valgamaa', 'Tartumaa',
'Jõgevamaa', 'Lääne-Virumaa', 'Võrumaa', 'Põlvamaa', 'Ida-Virumaa'],
dtype=object)
https://www.kaggle.com/code/fabiendaniel/mapping-locations-and-county-codes/notebook for county codes Here, they remove the 'maa' appendix from the county names. but is this really needed?
Data Merging (now we merge everything to train)¶
train.head()
| county | is_business | product_type | target | is_consumption | datetime | data_block_id | row_id | prediction_unit_id | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 1 | 0.713 | 0 | 2021-09-01 | 0 | 0 | 0 |
| 1 | 0 | 0 | 1 | 96.590 | 1 | 2021-09-01 | 0 | 1 | 0 |
| 2 | 0 | 0 | 2 | 0.000 | 0 | 2021-09-01 | 0 | 2 | 1 |
| 3 | 0 | 0 | 2 | 17.314 | 1 | 2021-09-01 | 0 | 3 | 1 |
| 4 | 0 | 0 | 3 | 2.904 | 0 | 2021-09-01 | 0 | 4 | 2 |
Merge Client¶
client.head()
| product_type | county | eic_count | installed_capacity | is_business | date | data_block_id | |
|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 108 | 952.89 | 0 | 2021-09-01 | 2 |
| 1 | 2 | 0 | 17 | 166.40 | 0 | 2021-09-01 | 2 |
| 2 | 3 | 0 | 688 | 7207.88 | 0 | 2021-09-01 | 2 |
| 3 | 0 | 0 | 5 | 400.00 | 1 | 2021-09-01 | 2 |
| 4 | 1 | 0 | 43 | 1411.00 | 1 | 2021-09-01 | 2 |
print(client.size, train.size)
293433 18165168
# append '_client' to merged columns
client.columns = [f"{column}_client" if column not in ['data_block_id', 'county', 'is_business', 'product_type'] else column for column in client.columns]
# merge train and client
merged_df = pd.merge(train, client, on=['data_block_id', 'county', 'is_business', 'product_type'], how='left')
merged_df.head()
| county | is_business | product_type | target | is_consumption | datetime | data_block_id | row_id | prediction_unit_id | eic_count_client | installed_capacity_client | date_client | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 1 | 0.713 | 0 | 2021-09-01 | 0 | 0 | 0 | NaN | NaN | NaT |
| 1 | 0 | 0 | 1 | 96.590 | 1 | 2021-09-01 | 0 | 1 | 0 | NaN | NaN | NaT |
| 2 | 0 | 0 | 2 | 0.000 | 0 | 2021-09-01 | 0 | 2 | 1 | NaN | NaN | NaT |
| 3 | 0 | 0 | 2 | 17.314 | 1 | 2021-09-01 | 0 | 3 | 1 | NaN | NaN | NaT |
| 4 | 0 | 0 | 3 | 2.904 | 0 | 2021-09-01 | 0 | 4 | 2 | NaN | NaN | NaT |
there are a few/a lot null values, especially at the beginning and end of period
# how many eic counts per data_block_id?
merged_df[merged_df.eic_count_client.isnull()].data_block_id.value_counts()
data_block_id 0 2928 1 2928 30 144 31 144 122 144 123 144 154 96 607 96 153 96 584 48 540 48 533 48 541 48 572 48 573 48 580 48 583 48 596 48 590 48 591 48 595 48 519 48 602 48 603 48 606 48 608 48 611 48 532 48 458 48 518 48 514 48 91 48 92 48 172 48 173 48 203 48 204 48 215 48 216 48 273 48 274 48 365 48 366 48 395 48 396 48 426 48 427 48 513 48 612 48 Name: count, dtype: int64
# do dates in train and client overlap?
print(set(client.date_client.dt.date) ^ set(train.datetime.dt.date))
print(set(train.data_block_id) ^ set(client.data_block_id))
{datetime.date(2023, 5, 30), datetime.date(2023, 5, 31)}
{0, 1}
def print_block(dbid):
display("TRAIN", train[train['data_block_id'] == dbid])
# display("FORC WEATHER", forecast_weather[forecast_weather['data_block_id'] == dbid])
display("CLIENT", client[client['data_block_id'] == dbid])
# display("HIST WEATHER", historical_weather[historical_weather['data_block_id'] == dbid])
# display("E PRICES", electricity_prices[electricity_prices['data_block_id'] == dbid])
# display("G PRICES", gas_prices[gas_prices['data_block_id'] == dbid])
print_block(10)
'TRAIN'
| county | is_business | product_type | target | is_consumption | datetime | data_block_id | row_id | prediction_unit_id | |
|---|---|---|---|---|---|---|---|---|---|
| 29280 | 0 | 0 | 1 | 0.000 | 0 | 2021-09-11 00:00:00 | 10 | 29280 | 0 |
| 29281 | 0 | 0 | 1 | 109.095 | 1 | 2021-09-11 00:00:00 | 10 | 29281 | 0 |
| 29282 | 0 | 0 | 2 | 0.000 | 0 | 2021-09-11 00:00:00 | 10 | 29282 | 1 |
| 29283 | 0 | 0 | 2 | 17.503 | 1 | 2021-09-11 00:00:00 | 10 | 29283 | 1 |
| 29284 | 0 | 0 | 3 | 0.077 | 0 | 2021-09-11 00:00:00 | 10 | 29284 | 2 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 32203 | 15 | 0 | 3 | 76.522 | 1 | 2021-09-11 23:00:00 | 10 | 32203 | 58 |
| 32204 | 15 | 1 | 1 | 0.100 | 0 | 2021-09-11 23:00:00 | 10 | 32204 | 59 |
| 32205 | 15 | 1 | 1 | 17.543 | 1 | 2021-09-11 23:00:00 | 10 | 32205 | 59 |
| 32206 | 15 | 1 | 3 | 0.000 | 0 | 2021-09-11 23:00:00 | 10 | 32206 | 60 |
| 32207 | 15 | 1 | 3 | 292.390 | 1 | 2021-09-11 23:00:00 | 10 | 32207 | 60 |
2928 rows × 9 columns
'CLIENT'
| product_type | county | eic_count_client | installed_capacity_client | is_business | date_client | data_block_id | |
|---|---|---|---|---|---|---|---|
| 488 | 1 | 0 | 108 | 952.89 | 0 | 2021-09-09 | 10 |
| 489 | 2 | 0 | 17 | 166.40 | 0 | 2021-09-09 | 10 |
| 490 | 3 | 0 | 687 | 7199.88 | 0 | 2021-09-09 | 10 |
| 491 | 0 | 0 | 5 | 400.00 | 1 | 2021-09-09 | 10 |
| 492 | 1 | 0 | 43 | 1411.00 | 1 | 2021-09-09 | 10 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 544 | 3 | 14 | 67 | 3114.60 | 1 | 2021-09-09 | 10 |
| 545 | 1 | 15 | 10 | 83.20 | 0 | 2021-09-09 | 10 |
| 546 | 3 | 15 | 61 | 918.20 | 0 | 2021-09-09 | 10 |
| 547 | 1 | 15 | 7 | 325.00 | 1 | 2021-09-09 | 10 |
| 548 | 3 | 15 | 49 | 1778.70 | 1 | 2021-09-09 | 10 |
61 rows × 7 columns
There is a problem with NULL values after merging. one source are the start and end dates, but we don't know whats happening in between and whether this is problematic. Maybe some client data is sporadically missing?
Merge Gas Prices¶
gas_prices.head()
| forecast_date | lowest_price_per_mwh | highest_price_per_mwh | origin_date | data_block_id | |
|---|---|---|---|---|---|
| 0 | 2021-09-01 | 45.23 | 46.32 | 2021-08-31 | 1 |
| 1 | 2021-09-02 | 45.62 | 46.29 | 2021-09-01 | 2 |
| 2 | 2021-09-03 | 45.85 | 46.40 | 2021-09-02 | 3 |
| 3 | 2021-09-04 | 46.30 | 46.80 | 2021-09-03 | 4 |
| 4 | 2021-09-05 | 46.30 | 46.58 | 2021-09-04 | 5 |
# append _gas_prices to columns
gas_prices.columns = [f"{column}_gas_prices" if column != 'data_block_id' else column for column in gas_prices.columns]
# merge gas_prices
merged_df = pd.merge(merged_df, gas_prices, on=['data_block_id'], how='left')
merged_df.tail()
| county | is_business | product_type | target | is_consumption | datetime | data_block_id | row_id | prediction_unit_id | eic_count_client | installed_capacity_client | date_client | forecast_date_gas_prices | lowest_price_per_mwh_gas_prices | highest_price_per_mwh_gas_prices | origin_date_gas_prices | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2018347 | 15 | 1 | 0 | 197.233 | 1 | 2023-05-31 23:00:00 | 637 | 2018347 | 64 | 15.0 | 620.0 | 2023-05-29 | 2023-05-30 | 29.0 | 34.0 | 2023-05-29 |
| 2018348 | 15 | 1 | 1 | 0.000 | 0 | 2023-05-31 23:00:00 | 637 | 2018348 | 59 | 20.0 | 624.5 | 2023-05-29 | 2023-05-30 | 29.0 | 34.0 | 2023-05-29 |
| 2018349 | 15 | 1 | 1 | 28.404 | 1 | 2023-05-31 23:00:00 | 637 | 2018349 | 59 | 20.0 | 624.5 | 2023-05-29 | 2023-05-30 | 29.0 | 34.0 | 2023-05-29 |
| 2018350 | 15 | 1 | 3 | 0.000 | 0 | 2023-05-31 23:00:00 | 637 | 2018350 | 60 | 55.0 | 2188.2 | 2023-05-29 | 2023-05-30 | 29.0 | 34.0 | 2023-05-29 |
| 2018351 | 15 | 1 | 3 | 196.240 | 1 | 2023-05-31 23:00:00 | 637 | 2018351 | 60 | 55.0 | 2188.2 | 2023-05-29 | 2023-05-30 | 29.0 | 34.0 | 2023-05-29 |
Merge Electricity Prices¶
# add time column for merging with electricity data
merged_df['time_of_day'] = merged_df['datetime'].dt.time
merged_df.head()
| county | is_business | product_type | target | is_consumption | datetime | data_block_id | row_id | prediction_unit_id | eic_count_client | installed_capacity_client | date_client | forecast_date_gas_prices | lowest_price_per_mwh_gas_prices | highest_price_per_mwh_gas_prices | origin_date_gas_prices | time_of_day | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 1 | 0.713 | 0 | 2021-09-01 | 0 | 0 | 0 | NaN | NaN | NaT | NaT | NaN | NaN | NaT | 00:00:00 |
| 1 | 0 | 0 | 1 | 96.590 | 1 | 2021-09-01 | 0 | 1 | 0 | NaN | NaN | NaT | NaT | NaN | NaN | NaT | 00:00:00 |
| 2 | 0 | 0 | 2 | 0.000 | 0 | 2021-09-01 | 0 | 2 | 1 | NaN | NaN | NaT | NaT | NaN | NaN | NaT | 00:00:00 |
| 3 | 0 | 0 | 2 | 17.314 | 1 | 2021-09-01 | 0 | 3 | 1 | NaN | NaN | NaT | NaT | NaN | NaN | NaT | 00:00:00 |
| 4 | 0 | 0 | 3 | 2.904 | 0 | 2021-09-01 | 0 | 4 | 2 | NaN | NaN | NaT | NaT | NaN | NaN | NaT | 00:00:00 |
# Merge electricity prices
# the prices are available hourly -> create new column with time
electricity_prices['time_of_day'] = electricity_prices.forecast_date.dt.time
# append electricity_prices to column names
electricity_prices.columns = [f"{column}_electricity_prices" if column not in ['time_of_day','data_block_id'] else column for column in electricity_prices.columns]
Merge Electricity Prices¶
# merge electricity_prices
merged_df = pd.merge(merged_df, electricity_prices, on = ['data_block_id', 'time_of_day'], how='left')
merged_df.head()
| county | is_business | product_type | target | is_consumption | datetime | data_block_id | row_id | prediction_unit_id | eic_count_client | installed_capacity_client | date_client | forecast_date_gas_prices | lowest_price_per_mwh_gas_prices | highest_price_per_mwh_gas_prices | origin_date_gas_prices | time_of_day | forecast_date_electricity_prices | euros_per_mwh_electricity_prices | origin_date_electricity_prices | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 1 | 0.713 | 0 | 2021-09-01 | 0 | 0 | 0 | NaN | NaN | NaT | NaT | NaN | NaN | NaT | 00:00:00 | NaT | NaN | NaT |
| 1 | 0 | 0 | 1 | 96.590 | 1 | 2021-09-01 | 0 | 1 | 0 | NaN | NaN | NaT | NaT | NaN | NaN | NaT | 00:00:00 | NaT | NaN | NaT |
| 2 | 0 | 0 | 2 | 0.000 | 0 | 2021-09-01 | 0 | 2 | 1 | NaN | NaN | NaT | NaT | NaN | NaN | NaT | 00:00:00 | NaT | NaN | NaT |
| 3 | 0 | 0 | 2 | 17.314 | 1 | 2021-09-01 | 0 | 3 | 1 | NaN | NaN | NaT | NaT | NaN | NaN | NaT | 00:00:00 | NaT | NaN | NaT |
| 4 | 0 | 0 | 3 | 2.904 | 0 | 2021-09-01 | 0 | 4 | 2 | NaN | NaN | NaT | NaT | NaN | NaN | NaT | 00:00:00 | NaT | NaN | NaT |
Merge Historical Weather¶
# historic weather
historical_weather.head()
| datetime | temperature | dewpoint | rain | snowfall | surface_pressure | cloudcover_total | cloudcover_low | cloudcover_mid | cloudcover_high | windspeed_10m | winddirection_10m | shortwave_radiation | direct_solar_radiation | diffuse_radiation | latitude | longitude | data_block_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2021-09-01 | 14.4 | 12.0 | 0.0 | 0.0 | 1015.8 | 4 | 4 | 0 | 0 | 6.694444 | 3 | 0.0 | 0.0 | 0.0 | 57.6 | 21.7 | 1 |
| 1 | 2021-09-01 | 14.0 | 12.0 | 0.0 | 0.0 | 1010.6 | 7 | 8 | 0 | 0 | 4.944444 | 353 | 0.0 | 0.0 | 0.0 | 57.6 | 22.2 | 1 |
| 2 | 2021-09-01 | 14.4 | 12.8 | 0.0 | 0.0 | 1014.9 | 6 | 7 | 0 | 0 | 5.833333 | 348 | 0.0 | 0.0 | 0.0 | 57.6 | 22.7 | 1 |
| 3 | 2021-09-01 | 15.4 | 13.0 | 0.0 | 0.0 | 1014.4 | 4 | 2 | 4 | 0 | 7.111111 | 349 | 0.0 | 0.0 | 0.0 | 57.6 | 23.2 | 1 |
| 4 | 2021-09-01 | 15.9 | 12.6 | 0.0 | 0.0 | 1013.8 | 12 | 7 | 0 | 20 | 8.388889 | 360 | 0.0 | 0.0 | 0.0 | 57.6 | 23.7 | 1 |
# get county and county_name from weather_station_to_county_mapping (merge on latitude and longitude)
# round lat and long to avoid mismatching due to different accuracy
historical_weather.latitude = historical_weather.latitude.round(1)
historical_weather.longitude = historical_weather.longitude.round(1)
weather_station_to_county_mapping.latitude = weather_station_to_county_mapping.latitude.round(1)
weather_station_to_county_mapping.longitude = weather_station_to_county_mapping.longitude.round(1)
# merge historical weather to get counties
merged_hist_weather = pd.merge(historical_weather, weather_station_to_county_mapping, on=['latitude', 'longitude'], how='left')
# get time of day
merged_hist_weather['time_of_day'] = merged_hist_weather['datetime'].dt.time
# aggregate by county and time (summarize weather stations for same county)
merged_hist_weather = merged_hist_weather.groupby(['county', 'time_of_day', 'datetime', 'data_block_id']).mean(numeric_only=True).reset_index()
# append _hist_weather to column names
merged_hist_weather.columns = [f"{column}_hist_weather" if column not in ['county', 'time_of_day','data_block_id'] else column for column in merged_hist_weather.columns]
merged_hist_weather.sample()
| county | time_of_day | datetime_hist_weather | data_block_id | temperature_hist_weather | dewpoint_hist_weather | rain_hist_weather | snowfall_hist_weather | surface_pressure_hist_weather | cloudcover_total_hist_weather | cloudcover_low_hist_weather | cloudcover_mid_hist_weather | cloudcover_high_hist_weather | windspeed_10m_hist_weather | winddirection_10m_hist_weather | shortwave_radiation_hist_weather | direct_solar_radiation_hist_weather | diffuse_radiation_hist_weather | latitude_hist_weather | longitude_hist_weather | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 223220 | 15.0 | 14:00:00 | 2022-11-30 14:00:00 | 457 | -2.78 | -4.76 | 0.0 | 0.056 | 1021.42 | 100.0 | 94.6 | 83.6 | 0.4 | 2.211111 | 313.4 | 68.2 | 1.4 | 66.8 | 57.78 | 27.1 |
# merge to merged_df
merged_df = pd.merge(merged_df, merged_hist_weather, on=['data_block_id', 'time_of_day', 'county'], how='left')
Merge Forecast Weather¶
forecast_weather.head()
| latitude | longitude | origin_datetime | hours_ahead | temperature | dewpoint | cloudcover_high | cloudcover_low | cloudcover_mid | cloudcover_total | 10_metre_u_wind_component | 10_metre_v_wind_component | data_block_id | forecast_datetime | direct_solar_radiation | surface_solar_radiation_downwards | snowfall | total_precipitation | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 57.6 | 21.7 | 2021-09-01 00:00:00+00:00 | 1 | 15.655786 | 11.553613 | 0.904816 | 0.019714 | 0.000000 | 0.905899 | -0.411328 | -9.106137 | 1 | 2021-09-01 01:00:00+00:00 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 57.6 | 22.2 | 2021-09-01 00:00:00+00:00 | 1 | 13.003931 | 10.689844 | 0.886322 | 0.004456 | 0.000000 | 0.886658 | 0.206347 | -5.355405 | 1 | 2021-09-01 01:00:00+00:00 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 57.6 | 22.7 | 2021-09-01 00:00:00+00:00 | 1 | 14.206567 | 11.671777 | 0.729034 | 0.005615 | 0.000000 | 0.730499 | 1.451587 | -7.417905 | 1 | 2021-09-01 01:00:00+00:00 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 57.6 | 23.2 | 2021-09-01 00:00:00+00:00 | 1 | 14.844507 | 12.264917 | 0.336304 | 0.074341 | 0.000626 | 0.385468 | 1.090869 | -9.163999 | 1 | 2021-09-01 01:00:00+00:00 | 0.0 | 0.0 | 0.0 | 0.0 |
| 4 | 57.6 | 23.7 | 2021-09-01 00:00:00+00:00 | 1 | 15.293848 | 12.458887 | 0.102875 | 0.088074 | 0.000015 | 0.176590 | 1.268481 | -8.975766 | 1 | 2021-09-01 01:00:00+00:00 | 0.0 | 0.0 | 0.0 | 0.0 |
# forecast weather
#round lat and long
forecast_weather.latitude = forecast_weather.latitude.round(1)
forecast_weather.longitude = forecast_weather.longitude.round(1)
# merge to get counties
merged_forecast_weather = pd.merge(forecast_weather, weather_station_to_county_mapping, on=['latitude', 'longitude'], how='left')
# merged_forecast_weather['time_of_day'] = merged_forecast_weather.
# # aggregate for duplicate locations
merged_forecast_weather = merged_forecast_weather.groupby(['county', 'forecast_datetime', 'data_block_id']).mean(numeric_only=True).reset_index()
# append forecast_weather to column names
merged_forecast_weather.columns = [f"{column}_forecast_weather" if column not in ['county', 'forecast_datetime','data_block_id'] else column for column in merged_forecast_weather.columns]
merged_forecast_weather.sample()
| county | forecast_datetime | data_block_id | latitude_forecast_weather | longitude_forecast_weather | hours_ahead_forecast_weather | temperature_forecast_weather | dewpoint_forecast_weather | cloudcover_high_forecast_weather | cloudcover_low_forecast_weather | cloudcover_mid_forecast_weather | cloudcover_total_forecast_weather | 10_metre_u_wind_component_forecast_weather | 10_metre_v_wind_component_forecast_weather | direct_solar_radiation_forecast_weather | surface_solar_radiation_downwards_forecast_weather | snowfall_forecast_weather | total_precipitation_forecast_weather | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 337175 | 11.0 | 2021-09-19 00:00:00+00:00 | 18 | 58.35 | 26.7 | 24.0 | 6.82879 | 2.614771 | 0.309937 | 0.300819 | 0.0 | 0.506729 | -4.96759 | -2.603552 | 0.0 | 0.0 | 0.0 | 0.0 |
merged_df.datetime.dt.tz_localize('EET', ambiguous=True, nonexistent='shift_forward')[80000]
Timestamp('2021-09-28 07:00:00+0300', tz='EET')
merged_df.datetime.dt.tz_localize('EET', ambiguous=True, nonexistent='shift_forward')[1500000]
Timestamp('2022-12-20 07:00:00+0200', tz='EET')
merged_forecast_weather.forecast_datetime.dt.tz_convert('EET')[6000]
Timestamp('2022-01-04 15:00:00+0200', tz='EET')
# add EET timezone to datetime, and handle daylight-savings
merged_df['datetime_localized'] = merged_df.datetime.dt.tz_localize('EET', ambiguous=True, nonexistent='shift_forward')
# convert UTC timezone to EET timezone in forecast weather
merged_forecast_weather['datetime_EET'] = merged_forecast_weather.forecast_datetime.dt.tz_convert('EET')
merged_forecast_weather.query('data_block_id == 300')
| county | forecast_datetime | data_block_id | latitude_forecast_weather | longitude_forecast_weather | hours_ahead_forecast_weather | temperature_forecast_weather | dewpoint_forecast_weather | cloudcover_high_forecast_weather | cloudcover_low_forecast_weather | cloudcover_mid_forecast_weather | cloudcover_total_forecast_weather | 10_metre_u_wind_component_forecast_weather | 10_metre_v_wind_component_forecast_weather | direct_solar_radiation_forecast_weather | surface_solar_radiation_downwards_forecast_weather | snowfall_forecast_weather | total_precipitation_forecast_weather | datetime_EET | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 14329 | 0.0 | 2022-06-27 01:00:00+00:00 | 300 | 59.30 | 24.866667 | 1.0 | 17.726831 | 15.787215 | 0.000000 | 0.003357 | 0.008809 | 0.009867 | -1.264754 | 1.363535 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 2022-06-27 04:00:00+03:00 |
| 14331 | 0.0 | 2022-06-27 02:00:00+00:00 | 300 | 59.30 | 24.866667 | 2.0 | 17.934391 | 16.112492 | 0.000000 | 0.003840 | 0.001831 | 0.005351 | -1.073307 | 1.067064 | 76.571599 | 9.877279 | 0.0 | 0.000000 | 2022-06-27 05:00:00+03:00 |
| 14333 | 0.0 | 2022-06-27 03:00:00+00:00 | 300 | 59.30 | 24.866667 | 3.0 | 19.487288 | 16.575301 | 0.000000 | 0.000000 | 0.000244 | 0.000244 | -1.185901 | 1.050166 | 289.486844 | 68.890875 | 0.0 | 0.000000 | 2022-06-27 06:00:00+03:00 |
| 14335 | 0.0 | 2022-06-27 04:00:00+00:00 | 300 | 59.30 | 24.866667 | 4.0 | 21.966781 | 17.163924 | 0.000000 | 0.000262 | 0.000280 | 0.000539 | -1.105384 | 0.556716 | 488.340764 | 166.954867 | 0.0 | 0.000000 | 2022-06-27 07:00:00+03:00 |
| 14337 | 0.0 | 2022-06-27 05:00:00+00:00 | 300 | 59.30 | 24.866667 | 5.0 | 24.450098 | 17.612248 | 0.000000 | 0.008311 | 0.000163 | 0.008382 | -1.017414 | 0.301280 | 621.634103 | 286.280498 | 0.0 | 0.000000 | 2022-06-27 08:00:00+03:00 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 442478 | 15.0 | 2022-06-28 20:00:00+00:00 | 300 | 57.78 | 27.100000 | 44.0 | 22.470117 | 19.469775 | 0.528113 | 0.015137 | 0.058371 | 0.556158 | -1.712890 | 0.529172 | 2.514667 | 0.056000 | 0.0 | 0.000000 | 2022-06-28 23:00:00+03:00 |
| 442480 | 15.0 | 2022-06-28 21:00:00+00:00 | 300 | 57.78 | 27.100000 | 45.0 | 21.575977 | 19.339404 | 0.370251 | 0.031049 | 0.076743 | 0.419739 | -1.573499 | 0.936665 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 2022-06-29 00:00:00+03:00 |
| 442482 | 15.0 | 2022-06-28 22:00:00+00:00 | 300 | 57.78 | 27.100000 | 46.0 | 20.878320 | 19.020068 | 0.509418 | 0.072824 | 0.216644 | 0.618463 | -1.143979 | 0.731405 | 0.000000 | 0.000000 | 0.0 | 0.000464 | 2022-06-29 01:00:00+03:00 |
| 442484 | 15.0 | 2022-06-28 23:00:00+00:00 | 300 | 57.78 | 27.100000 | 47.0 | 20.642871 | 18.551855 | 0.942194 | 0.114362 | 0.345221 | 0.961719 | -1.657865 | -0.578932 | 0.000000 | 0.000000 | 0.0 | 0.001295 | 2022-06-29 02:00:00+03:00 |
| 442486 | 15.0 | 2022-06-29 00:00:00+00:00 | 300 | 57.78 | 27.100000 | 48.0 | 19.568115 | 17.740234 | 0.954700 | 0.055295 | 0.261322 | 0.957941 | -2.176746 | 0.905358 | 0.000000 | 0.000000 | 0.0 | 0.001783 | 2022-06-29 03:00:00+03:00 |
720 rows × 19 columns
# merge forecast_weather
merged_df = pd.merge(merged_df, merged_forecast_weather, left_on=['data_block_id', 'datetime_localized', 'county'], right_on=['data_block_id', 'datetime_EET', 'county'], how='left')
merged_df.sample(10)
| county | is_business | product_type | target | is_consumption | datetime | data_block_id | row_id | prediction_unit_id | eic_count_client | ... | cloudcover_low_forecast_weather | cloudcover_mid_forecast_weather | cloudcover_total_forecast_weather | 10_metre_u_wind_component_forecast_weather | 10_metre_v_wind_component_forecast_weather | direct_solar_radiation_forecast_weather | surface_solar_radiation_downwards_forecast_weather | snowfall_forecast_weather | total_precipitation_forecast_weather | datetime_EET | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 667583 | 5 | 0 | 1 | 79.567 | 1 | 2022-04-06 06:00:00 | 217 | 667583 | 19 | 35.0 | ... | 0.149887 | 0.448029 | 0.499992 | 3.157317 | 2.655265 | 0.000000 | 0.000000 | 0.000181 | 0.000181 | 2022-04-06 06:00:00+03:00 |
| 1899807 | 15 | 1 | 3 | 277.765 | 1 | 2023-04-24 22:00:00 | 600 | 1899807 | 60 | 50.0 | ... | 0.034052 | 0.176721 | 0.201703 | -0.103642 | 3.252309 | -0.012444 | 0.000000 | 0.000000 | 0.000000 | 2023-04-24 22:00:00+03:00 |
| 1322989 | 5 | 1 | 1 | 115.854 | 1 | 2022-10-27 05:00:00 | 421 | 1322989 | 22 | 14.0 | ... | 0.999993 | 0.000008 | 0.999997 | 3.237482 | -1.332670 | 0.000000 | 0.000000 | 0.000000 | 0.000026 | 2022-10-27 05:00:00+03:00 |
| 1665732 | 8 | 1 | 3 | 0.023 | 0 | 2023-02-09 23:00:00 | 526 | 1665732 | 33 | 34.0 | ... | 0.001678 | 1.000000 | 1.000006 | 4.558232 | 4.837220 | 0.000000 | 0.000000 | 0.000004 | 0.000005 | 2023-02-09 23:00:00+02:00 |
| 337989 | 4 | 1 | 0 | 147.829 | 1 | 2021-12-22 17:00:00 | 112 | 337989 | 17 | 6.0 | ... | 1.000000 | 0.000000 | 1.000000 | 2.387765 | -2.303231 | 0.000000 | 0.000000 | 0.000016 | 0.000014 | 2021-12-22 17:00:00+02:00 |
| 637088 | 9 | 1 | 1 | 0.370 | 0 | 2022-03-27 19:00:00 | 207 | 637088 | 36 | 14.0 | ... | 0.996928 | 1.000005 | 1.000000 | 1.707896 | 2.316880 | 2.986667 | 42.040000 | 0.000056 | 0.000052 | 2022-03-27 19:00:00+03:00 |
| 118581 | 15 | 1 | 1 | 40.916 | 1 | 2021-10-11 03:00:00 | 40 | 118581 | 59 | 8.0 | ... | 0.000000 | 0.000601 | 0.628912 | 2.137020 | 2.982671 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2021-10-11 03:00:00+03:00 |
| 1275251 | 5 | 1 | 0 | 285.550 | 1 | 2022-10-12 14:00:00 | 406 | 1275251 | 21 | 7.0 | ... | 0.101276 | 0.143452 | 0.204136 | 3.546041 | -1.241343 | 694.192778 | 347.429722 | 0.000000 | 0.000047 | 2022-10-12 14:00:00+03:00 |
| 1298896 | 3 | 1 | 1 | 0.000 | 0 | 2022-10-19 20:00:00 | 413 | 1298896 | 13 | 12.0 | ... | 0.387380 | 0.422170 | 0.986809 | 2.940112 | -2.279040 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2022-10-19 20:00:00+03:00 |
| 1432208 | 7 | 1 | 3 | 0.319 | 0 | 2022-11-29 06:00:00 | 454 | 1432208 | 30 | 63.0 | ... | 1.000004 | 0.317584 | 1.000000 | -3.999664 | 1.836443 | 0.000000 | 0.000000 | 0.000007 | 0.000009 | 2022-11-29 06:00:00+02:00 |
10 rows × 55 columns
merged_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2018352 entries, 0 to 2018351 Data columns (total 55 columns): # Column Dtype --- ------ ----- 0 county int64 1 is_business int64 2 product_type int64 3 target float64 4 is_consumption int64 5 datetime datetime64[ns] 6 data_block_id int64 7 row_id int64 8 prediction_unit_id int64 9 eic_count_client float64 10 installed_capacity_client float64 11 date_client datetime64[ns] 12 forecast_date_gas_prices datetime64[ns] 13 lowest_price_per_mwh_gas_prices float64 14 highest_price_per_mwh_gas_prices float64 15 origin_date_gas_prices datetime64[ns] 16 time_of_day object 17 forecast_date_electricity_prices datetime64[ns] 18 euros_per_mwh_electricity_prices float64 19 origin_date_electricity_prices datetime64[ns] 20 datetime_hist_weather datetime64[ns] 21 temperature_hist_weather float64 22 dewpoint_hist_weather float64 23 rain_hist_weather float64 24 snowfall_hist_weather float64 25 surface_pressure_hist_weather float64 26 cloudcover_total_hist_weather float64 27 cloudcover_low_hist_weather float64 28 cloudcover_mid_hist_weather float64 29 cloudcover_high_hist_weather float64 30 windspeed_10m_hist_weather float64 31 winddirection_10m_hist_weather float64 32 shortwave_radiation_hist_weather float64 33 direct_solar_radiation_hist_weather float64 34 diffuse_radiation_hist_weather float64 35 latitude_hist_weather float64 36 longitude_hist_weather float64 37 datetime_localized datetime64[ns, EET] 38 forecast_datetime datetime64[ns, UTC] 39 latitude_forecast_weather float64 40 longitude_forecast_weather float64 41 hours_ahead_forecast_weather float64 42 temperature_forecast_weather float64 43 dewpoint_forecast_weather float64 44 cloudcover_high_forecast_weather float64 45 cloudcover_low_forecast_weather float64 46 cloudcover_mid_forecast_weather float64 47 cloudcover_total_forecast_weather float64 48 10_metre_u_wind_component_forecast_weather float64 49 10_metre_v_wind_component_forecast_weather float64 50 direct_solar_radiation_forecast_weather float64 51 surface_solar_radiation_downwards_forecast_weather float64 52 snowfall_forecast_weather float64 53 total_precipitation_forecast_weather float64 54 datetime_EET datetime64[ns, EET] dtypes: datetime64[ns, EET](2), datetime64[ns, UTC](1), datetime64[ns](7), float64(37), int64(7), object(1) memory usage: 846.9+ MB
merged_df.size
111009360
Checking for NULL values on merged data¶
merged_df.isnull().sum()
county 0 is_business 0 product_type 0 target 528 is_consumption 0 datetime 0 data_block_id 0 row_id 0 prediction_unit_id 0 eic_count_client 8640 installed_capacity_client 8640 date_client 8640 forecast_date_gas_prices 2928 lowest_price_per_mwh_gas_prices 2928 highest_price_per_mwh_gas_prices 2928 origin_date_gas_prices 2928 time_of_day 0 forecast_date_electricity_prices 3196 euros_per_mwh_electricity_prices 3196 origin_date_electricity_prices 3196 datetime_hist_weather 35064 temperature_hist_weather 35064 dewpoint_hist_weather 35064 rain_hist_weather 35064 snowfall_hist_weather 35064 surface_pressure_hist_weather 35064 cloudcover_total_hist_weather 35064 cloudcover_low_hist_weather 35064 cloudcover_mid_hist_weather 35064 cloudcover_high_hist_weather 35064 windspeed_10m_hist_weather 35064 winddirection_10m_hist_weather 35064 shortwave_radiation_hist_weather 35064 direct_solar_radiation_hist_weather 35064 diffuse_radiation_hist_weather 35064 latitude_hist_weather 35064 longitude_hist_weather 35064 datetime_localized 0 forecast_datetime 33504 latitude_forecast_weather 33504 longitude_forecast_weather 33504 hours_ahead_forecast_weather 33504 temperature_forecast_weather 33504 dewpoint_forecast_weather 33504 cloudcover_high_forecast_weather 33504 cloudcover_low_forecast_weather 33504 cloudcover_mid_forecast_weather 33504 cloudcover_total_forecast_weather 33504 10_metre_u_wind_component_forecast_weather 33504 10_metre_v_wind_component_forecast_weather 33504 direct_solar_radiation_forecast_weather 33504 surface_solar_radiation_downwards_forecast_weather 33504 snowfall_forecast_weather 33504 total_precipitation_forecast_weather 33504 datetime_EET 33504 dtype: int64
merged_df.query('data_block_id != 0 and data_block_id != 1').isnull().sum()
county 0 is_business 0 product_type 0 target 528 is_consumption 0 datetime 0 data_block_id 0 row_id 0 prediction_unit_id 0 eic_count_client 2784 installed_capacity_client 2784 date_client 2784 forecast_date_gas_prices 0 lowest_price_per_mwh_gas_prices 0 highest_price_per_mwh_gas_prices 0 origin_date_gas_prices 0 time_of_day 0 forecast_date_electricity_prices 268 euros_per_mwh_electricity_prices 268 origin_date_electricity_prices 268 datetime_hist_weather 30528 temperature_hist_weather 30528 dewpoint_hist_weather 30528 rain_hist_weather 30528 snowfall_hist_weather 30528 surface_pressure_hist_weather 30528 cloudcover_total_hist_weather 30528 cloudcover_low_hist_weather 30528 cloudcover_mid_hist_weather 30528 cloudcover_high_hist_weather 30528 windspeed_10m_hist_weather 30528 winddirection_10m_hist_weather 30528 shortwave_radiation_hist_weather 30528 direct_solar_radiation_hist_weather 30528 diffuse_radiation_hist_weather 30528 latitude_hist_weather 30528 longitude_hist_weather 30528 datetime_localized 0 forecast_datetime 30528 latitude_forecast_weather 30528 longitude_forecast_weather 30528 hours_ahead_forecast_weather 30528 temperature_forecast_weather 30528 dewpoint_forecast_weather 30528 cloudcover_high_forecast_weather 30528 cloudcover_low_forecast_weather 30528 cloudcover_mid_forecast_weather 30528 cloudcover_total_forecast_weather 30528 10_metre_u_wind_component_forecast_weather 30528 10_metre_v_wind_component_forecast_weather 30528 direct_solar_radiation_forecast_weather 30528 surface_solar_radiation_downwards_forecast_weather 30528 snowfall_forecast_weather 30528 total_precipitation_forecast_weather 30528 datetime_EET 30528 dtype: int64
merged_df.query('data_block_id != 0 and data_block_id != 1')[merged_df.query('data_block_id != 0 and data_block_id != 1').isnull()].data_block_id
5856 NaN
5857 NaN
5858 NaN
5859 NaN
5860 NaN
..
2018347 NaN
2018348 NaN
2018349 NaN
2018350 NaN
2018351 NaN
Name: data_block_id, Length: 2012496, dtype: float64
#sns.scatterplot(data=merged_df, x='datetime', y= 'temperature_hist_weather', hue='county')
import missingno as msno
merged_df.head()
| county | is_business | product_type | target | is_consumption | datetime | data_block_id | row_id | prediction_unit_id | eic_count_client | ... | cloudcover_low_forecast_weather | cloudcover_mid_forecast_weather | cloudcover_total_forecast_weather | 10_metre_u_wind_component_forecast_weather | 10_metre_v_wind_component_forecast_weather | direct_solar_radiation_forecast_weather | surface_solar_radiation_downwards_forecast_weather | snowfall_forecast_weather | total_precipitation_forecast_weather | datetime_EET | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 1 | 0.713 | 0 | 2021-09-01 | 0 | 0 | 0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 1 | 0 | 0 | 1 | 96.590 | 1 | 2021-09-01 | 0 | 1 | 0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 2 | 0 | 0 | 2 | 0.000 | 0 | 2021-09-01 | 0 | 2 | 1 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 3 | 0 | 0 | 2 | 17.314 | 1 | 2021-09-01 | 0 | 3 | 1 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 4 | 0 | 0 | 3 | 2.904 | 0 | 2021-09-01 | 0 | 4 | 2 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
5 rows × 55 columns
msno.matrix(merged_df)
<Axes: >
msno.matrix(merged_df.query('data_block_id != 0 and data_block_id != 1'))
<Axes: >
msno.bar(merged_df)
<Axes: >
merged_df[(merged_df.temperature_hist_weather.isnull()) & (merged_df.data_block_id == 200)]
# .datetime.dt.date.unique()
| county | is_business | product_type | target | is_consumption | datetime | data_block_id | row_id | prediction_unit_id | eic_count_client | ... | cloudcover_low_forecast_weather | cloudcover_mid_forecast_weather | cloudcover_total_forecast_weather | 10_metre_u_wind_component_forecast_weather | 10_metre_v_wind_component_forecast_weather | direct_solar_radiation_forecast_weather | surface_solar_radiation_downwards_forecast_weather | snowfall_forecast_weather | total_precipitation_forecast_weather | datetime_EET | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 612200 | 12 | 1 | 3 | 1.250 | 0 | 2022-03-20 00:00:00 | 200 | 612200 | 49 | 11.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 612201 | 12 | 1 | 3 | 324.651 | 1 | 2022-03-20 00:00:00 | 200 | 612201 | 49 | 11.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 612332 | 12 | 1 | 3 | 1.250 | 0 | 2022-03-20 01:00:00 | 200 | 612332 | 49 | 11.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 612333 | 12 | 1 | 3 | 324.552 | 1 | 2022-03-20 01:00:00 | 200 | 612333 | 49 | 11.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 612464 | 12 | 1 | 3 | 1.200 | 0 | 2022-03-20 02:00:00 | 200 | 612464 | 49 | 11.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 612465 | 12 | 1 | 3 | 321.995 | 1 | 2022-03-20 02:00:00 | 200 | 612465 | 49 | 11.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 612596 | 12 | 1 | 3 | 1.250 | 0 | 2022-03-20 03:00:00 | 200 | 612596 | 49 | 11.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 612597 | 12 | 1 | 3 | 321.500 | 1 | 2022-03-20 03:00:00 | 200 | 612597 | 49 | 11.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 612728 | 12 | 1 | 3 | 1.200 | 0 | 2022-03-20 04:00:00 | 200 | 612728 | 49 | 11.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 612729 | 12 | 1 | 3 | 321.716 | 1 | 2022-03-20 04:00:00 | 200 | 612729 | 49 | 11.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 612860 | 12 | 1 | 3 | 1.300 | 0 | 2022-03-20 05:00:00 | 200 | 612860 | 49 | 11.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 612861 | 12 | 1 | 3 | 322.387 | 1 | 2022-03-20 05:00:00 | 200 | 612861 | 49 | 11.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 612992 | 12 | 1 | 3 | 1.100 | 0 | 2022-03-20 06:00:00 | 200 | 612992 | 49 | 11.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 612993 | 12 | 1 | 3 | 311.407 | 1 | 2022-03-20 06:00:00 | 200 | 612993 | 49 | 11.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 613124 | 12 | 1 | 3 | 1.153 | 0 | 2022-03-20 07:00:00 | 200 | 613124 | 49 | 11.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 613125 | 12 | 1 | 3 | 278.844 | 1 | 2022-03-20 07:00:00 | 200 | 613125 | 49 | 11.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 613256 | 12 | 1 | 3 | 18.506 | 0 | 2022-03-20 08:00:00 | 200 | 613256 | 49 | 11.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 613257 | 12 | 1 | 3 | 212.506 | 1 | 2022-03-20 08:00:00 | 200 | 613257 | 49 | 11.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 613388 | 12 | 1 | 3 | 57.796 | 0 | 2022-03-20 09:00:00 | 200 | 613388 | 49 | 11.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 613389 | 12 | 1 | 3 | 185.578 | 1 | 2022-03-20 09:00:00 | 200 | 613389 | 49 | 11.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 613520 | 12 | 1 | 3 | 97.859 | 0 | 2022-03-20 10:00:00 | 200 | 613520 | 49 | 11.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 613521 | 12 | 1 | 3 | 189.307 | 1 | 2022-03-20 10:00:00 | 200 | 613521 | 49 | 11.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 613652 | 12 | 1 | 3 | 113.126 | 0 | 2022-03-20 11:00:00 | 200 | 613652 | 49 | 11.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 613653 | 12 | 1 | 3 | 189.486 | 1 | 2022-03-20 11:00:00 | 200 | 613653 | 49 | 11.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 613784 | 12 | 1 | 3 | 115.212 | 0 | 2022-03-20 12:00:00 | 200 | 613784 | 49 | 11.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 613785 | 12 | 1 | 3 | 192.305 | 1 | 2022-03-20 12:00:00 | 200 | 613785 | 49 | 11.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 613916 | 12 | 1 | 3 | 99.160 | 0 | 2022-03-20 13:00:00 | 200 | 613916 | 49 | 11.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 613917 | 12 | 1 | 3 | 195.759 | 1 | 2022-03-20 13:00:00 | 200 | 613917 | 49 | 11.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 614048 | 12 | 1 | 3 | 77.348 | 0 | 2022-03-20 14:00:00 | 200 | 614048 | 49 | 11.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 614049 | 12 | 1 | 3 | 214.534 | 1 | 2022-03-20 14:00:00 | 200 | 614049 | 49 | 11.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 614180 | 12 | 1 | 3 | 54.285 | 0 | 2022-03-20 15:00:00 | 200 | 614180 | 49 | 11.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 614181 | 12 | 1 | 3 | 244.226 | 1 | 2022-03-20 15:00:00 | 200 | 614181 | 49 | 11.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 614312 | 12 | 1 | 3 | 15.541 | 0 | 2022-03-20 16:00:00 | 200 | 614312 | 49 | 11.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 614313 | 12 | 1 | 3 | 277.902 | 1 | 2022-03-20 16:00:00 | 200 | 614313 | 49 | 11.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 614444 | 12 | 1 | 3 | 1.150 | 0 | 2022-03-20 17:00:00 | 200 | 614444 | 49 | 11.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 614445 | 12 | 1 | 3 | 319.168 | 1 | 2022-03-20 17:00:00 | 200 | 614445 | 49 | 11.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 614576 | 12 | 1 | 3 | 1.150 | 0 | 2022-03-20 18:00:00 | 200 | 614576 | 49 | 11.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 614577 | 12 | 1 | 3 | 333.934 | 1 | 2022-03-20 18:00:00 | 200 | 614577 | 49 | 11.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 614708 | 12 | 1 | 3 | 1.200 | 0 | 2022-03-20 19:00:00 | 200 | 614708 | 49 | 11.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 614709 | 12 | 1 | 3 | 341.857 | 1 | 2022-03-20 19:00:00 | 200 | 614709 | 49 | 11.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 614840 | 12 | 1 | 3 | 1.200 | 0 | 2022-03-20 20:00:00 | 200 | 614840 | 49 | 11.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 614841 | 12 | 1 | 3 | 332.939 | 1 | 2022-03-20 20:00:00 | 200 | 614841 | 49 | 11.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 614972 | 12 | 1 | 3 | 1.200 | 0 | 2022-03-20 21:00:00 | 200 | 614972 | 49 | 11.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 614973 | 12 | 1 | 3 | 334.121 | 1 | 2022-03-20 21:00:00 | 200 | 614973 | 49 | 11.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 615104 | 12 | 1 | 3 | 1.250 | 0 | 2022-03-20 22:00:00 | 200 | 615104 | 49 | 11.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 615105 | 12 | 1 | 3 | 320.538 | 1 | 2022-03-20 22:00:00 | 200 | 615105 | 49 | 11.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 615236 | 12 | 1 | 3 | 1.200 | 0 | 2022-03-20 23:00:00 | 200 | 615236 | 49 | 11.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 615237 | 12 | 1 | 3 | 324.819 | 1 | 2022-03-20 23:00:00 | 200 | 615237 | 49 | 11.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
48 rows × 55 columns
merged_df.groupby('county').aggregate(lambda x: x.isnull().sum())
| is_business | product_type | target | is_consumption | datetime | data_block_id | row_id | prediction_unit_id | eic_count_client | installed_capacity_client | ... | cloudcover_low_forecast_weather | cloudcover_mid_forecast_weather | cloudcover_total_forecast_weather | 10_metre_u_wind_component_forecast_weather | 10_metre_v_wind_component_forecast_weather | direct_solar_radiation_forecast_weather | surface_solar_radiation_downwards_forecast_weather | snowfall_forecast_weather | total_precipitation_forecast_weather | datetime_EET | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| county | |||||||||||||||||||||
| 0 | 0 | 0 | 56 | 0 | 0 | 0 | 0 | 0 | 672 | 672 | ... | 288 | 288 | 288 | 288 | 288 | 288 | 288 | 288 | 288 | 288 |
| 1 | 0 | 0 | 24 | 0 | 0 | 0 | 0 | 0 | 288 | 288 | ... | 144 | 144 | 144 | 144 | 144 | 144 | 144 | 144 | 144 | 144 |
| 2 | 0 | 0 | 30 | 0 | 0 | 0 | 0 | 0 | 384 | 384 | ... | 96 | 96 | 96 | 96 | 96 | 96 | 96 | 96 | 96 | 96 |
| 3 | 0 | 0 | 32 | 0 | 0 | 0 | 0 | 0 | 384 | 384 | ... | 192 | 192 | 192 | 192 | 192 | 192 | 192 | 192 | 192 | 192 |
| 4 | 0 | 0 | 38 | 0 | 0 | 0 | 0 | 0 | 480 | 480 | ... | 192 | 192 | 192 | 192 | 192 | 192 | 192 | 192 | 192 | 192 |
| 5 | 0 | 0 | 40 | 0 | 0 | 0 | 0 | 0 | 576 | 576 | ... | 240 | 240 | 240 | 240 | 240 | 240 | 240 | 240 | 240 | 240 |
| 6 | 0 | 0 | 8 | 0 | 0 | 0 | 0 | 0 | 96 | 96 | ... | 48 | 48 | 48 | 48 | 48 | 48 | 48 | 48 | 48 | 48 |
| 7 | 0 | 0 | 46 | 0 | 0 | 0 | 0 | 0 | 1056 | 1056 | ... | 288 | 288 | 288 | 288 | 288 | 288 | 288 | 288 | 288 | 288 |
| 8 | 0 | 0 | 24 | 0 | 0 | 0 | 0 | 0 | 288 | 288 | ... | 144 | 144 | 144 | 144 | 144 | 144 | 144 | 144 | 144 | 144 |
| 9 | 0 | 0 | 32 | 0 | 0 | 0 | 0 | 0 | 384 | 384 | ... | 192 | 192 | 192 | 192 | 192 | 192 | 192 | 192 | 192 | 192 |
| 10 | 0 | 0 | 36 | 0 | 0 | 0 | 0 | 0 | 912 | 912 | ... | 240 | 240 | 240 | 240 | 240 | 240 | 240 | 240 | 240 | 240 |
| 11 | 0 | 0 | 52 | 0 | 0 | 0 | 0 | 0 | 1392 | 1392 | ... | 288 | 288 | 288 | 288 | 288 | 288 | 288 | 288 | 288 | 288 |
| 12 | 0 | 0 | 8 | 0 | 0 | 0 | 0 | 0 | 96 | 96 | ... | 30624 | 30624 | 30624 | 30624 | 30624 | 30624 | 30624 | 30624 | 30624 | 30624 |
| 13 | 0 | 0 | 32 | 0 | 0 | 0 | 0 | 0 | 384 | 384 | ... | 144 | 144 | 144 | 144 | 144 | 144 | 144 | 144 | 144 | 144 |
| 14 | 0 | 0 | 32 | 0 | 0 | 0 | 0 | 0 | 768 | 768 | ... | 192 | 192 | 192 | 192 | 192 | 192 | 192 | 192 | 192 | 192 |
| 15 | 0 | 0 | 38 | 0 | 0 | 0 | 0 | 0 | 480 | 480 | ... | 192 | 192 | 192 | 192 | 192 | 192 | 192 | 192 | 192 | 192 |
16 rows × 54 columns
train.query('county == 11').shape
(198000, 9)
weather_station_to_county_mapping.query('county == 12')
| county_name | longitude | latitude | county |
|---|
EDA¶
import seaborn as sns
subsampled_df = merged_df.sample(10000)
#sns.scatterplot(data=merged_df, x='datetime', y = 'target', hue='is_consumption')
sns.scatterplot(data=subsampled_df, x='datetime', y = 'target', hue='is_consumption')
<Axes: xlabel='datetime', ylabel='target'>
sns.kdeplot(data=subsampled_df, x = 'target', hue='is_consumption')
<Axes: xlabel='target', ylabel='Density'>
# subset into production df
production_df = merged_df.query('is_consumption == 0').groupby('datetime').mean(numeric_only=True)
# we should aggregate target by sum, weather variables by mean
production_df.head()
| county | is_business | product_type | target | is_consumption | data_block_id | row_id | prediction_unit_id | eic_count_client | installed_capacity_client | ... | cloudcover_high_forecast_weather | cloudcover_low_forecast_weather | cloudcover_mid_forecast_weather | cloudcover_total_forecast_weather | 10_metre_u_wind_component_forecast_weather | 10_metre_v_wind_component_forecast_weather | direct_solar_radiation_forecast_weather | surface_solar_radiation_downwards_forecast_weather | snowfall_forecast_weather | total_precipitation_forecast_weather | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| datetime | |||||||||||||||||||||
| 2021-09-01 00:00:00 | 7.393443 | 0.508197 | 2.0 | 0.071508 | 0.0 | 0.0 | 60.0 | 30.0 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2021-09-01 01:00:00 | 7.393443 | 0.508197 | 2.0 | 0.194295 | 0.0 | 0.0 | 182.0 | 30.0 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2021-09-01 02:00:00 | 7.393443 | 0.508197 | 2.0 | 0.030246 | 0.0 | 0.0 | 304.0 | 30.0 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2021-09-01 03:00:00 | 7.393443 | 0.508197 | 2.0 | 0.024869 | 0.0 | 0.0 | 426.0 | 30.0 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2021-09-01 04:00:00 | 7.393443 | 0.508197 | 2.0 | 0.040852 | 0.0 | 0.0 | 548.0 | 30.0 | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 44 columns
production_df.size
673728
sns.scatterplot(data=production_df, x = 'cloudcover_total_forecast_weather', y = 'target')
<Axes: xlabel='cloudcover_total_forecast_weather', ylabel='target'>
sns.scatterplot(data=production_df, x = 'direct_solar_radiation_forecast_weather', y = 'target')
<Axes: xlabel='direct_solar_radiation_forecast_weather', ylabel='target'>
sns.scatterplot(data=production_df, x = 'surface_solar_radiation_downwards_forecast_weather', y = 'target', hue='is_business')
<Axes: xlabel='surface_solar_radiation_downwards_forecast_weather', ylabel='target'>
production_df = merged_df.query('is_consumption == 0').sample(100000)
sns.scatterplot(data=production_df, x = 'surface_solar_radiation_downwards_forecast_weather', y = 'target', hue='product_type')
<Axes: xlabel='surface_solar_radiation_downwards_forecast_weather', ylabel='target'>
- Surface solar radiation seems to have a stronger correlation with target than direct_solar
- There seems to be a 'split' around 6000 (unit?) daily mean target
- We expected more businesses in the top-producers
client.describe()
| product_type | county | eic_count_client | installed_capacity_client | is_business | date_client | data_block_id | |
|---|---|---|---|---|---|---|---|
| count | 41919.000000 | 41919.000000 | 41919.000000 | 41919.000000 | 41919.000000 | 41919 | 41919.000000 |
| mean | 1.898996 | 7.297097 | 73.345118 | 1450.771451 | 0.536773 | 2022-07-18 21:34:22.921348096 | 322.898876 |
| min | 0.000000 | 0.000000 | 5.000000 | 5.500000 | 0.000000 | 2021-09-01 00:00:00 | 2.000000 |
| 25% | 1.000000 | 3.000000 | 13.000000 | 321.900000 | 0.000000 | 2022-02-13 00:00:00 | 167.000000 |
| 50% | 2.000000 | 7.000000 | 32.000000 | 645.200000 | 1.000000 | 2022-07-20 00:00:00 | 324.000000 |
| 75% | 3.000000 | 11.000000 | 70.000000 | 1567.150000 | 1.000000 | 2022-12-23 00:00:00 | 480.000000 |
| max | 3.000000 | 15.000000 | 1517.000000 | 19314.310000 | 1.000000 | 2023-05-29 00:00:00 | 637.000000 |
| std | 1.081713 | 4.780750 | 144.064389 | 2422.233120 | 0.498652 | NaN | 182.075724 |
Focussing only on larger providers (is there a trend visible?)¶
production_df = merged_df.query('(is_consumption == 0) and (1200 < installed_capacity_client < 1300)')
sns.scatterplot(data=production_df, x = 'surface_solar_radiation_downwards_forecast_weather', y = 'target', hue='product_type')
<Axes: xlabel='surface_solar_radiation_downwards_forecast_weather', ylabel='target'>
merged_df.query('is_consumption == 0').groupby('product_type')['installed_capacity_client'].aggregate(['min', 'mean', 'max'])
| min | mean | max | |
|---|---|---|---|
| product_type | |||
| 0 | 260.0 | 920.328878 | 4968.600 |
| 1 | 60.0 | 630.766002 | 5250.705 |
| 2 | 6.0 | 88.098622 | 303.000 |
| 3 | 165.0 | 2465.757342 | 19314.310 |
- product_type 2 attract producers with small installed capacity thus low production
- while product_type 3 attract the opposite cluster, producers with a lot of installed_capacity
Further quick Visualizations¶
merged_df.plot(x='datetime', y='euros_per_mwh_electricity_prices')
plt.title('euros_per_mwh_electricity_prices')
plt.show()
plt.plot(merged_df['datetime'], merged_df['lowest_price_per_mwh_gas_prices'], label = 'lowest price')
plt.plot(merged_df['datetime'], merged_df['highest_price_per_mwh_gas_prices'], label = 'highest price')
plt.title('highest and lowest price_per_mwh_gas_prices')
plt.legend()
plt.show()
merged_df.plot(x='datetime', y='temperature_hist_weather')
plt.title('temperature_hist_weather')
plt.show()
merged_df['dewpoint_hist_weather'].plot()
plt.title('dewpoint_hist_weather')
plt.show()
merged_df['rain_hist_weather'].plot()
plt.title('rain_hist_weather')
plt.show()
merged_df['snowfall_hist_weather'].plot()
plt.title('snowfall_hist_weather')
plt.show()
merged_df['cloudcover_total_hist_weather'].plot()
plt.title('cloudcover_total_hist_weather')
plt.show()
merged_df[['shortwave_radiation_hist_weather','direct_solar_radiation_hist_weather', 'diffuse_radiation_hist_weather']].plot()
plt.title('solar radiation')
plt.show()
# Create a 1x3 grid of subplots
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(15, 5))
# Plot the first graph
merged_df.groupby('county')['target'].mean().plot(kind='bar', ax=axes[0])
axes[0].set_title('Average Consumption per County')
axes[0].set_xlabel('County')
axes[0].set_ylabel('Average Consumption')
# Plot the second graph
merged_df.groupby('product_type')['target'].mean().plot(kind='bar', ax=axes[1])
axes[1].set_title('Average Consumption per Product type')
axes[1].set_xlabel('Product type')
axes[1].set_ylabel('Average Consumption')
# Plot the third graph
merged_df.groupby('is_business')['target'].mean().plot(kind='bar', ax=axes[2])
axes[2].set_title('Average Consumption per Business')
axes[2].set_xlabel('Business or not')
axes[2].set_ylabel('Average Consumption')
# Adjust layout to prevent clipping of titles
plt.tight_layout()
# Show the combined plot
plt.show()
def split_datetime(data, col="datetime"):
# What columns are of type datetime?
datetime_columns = data.select_dtypes(include='datetime64').columns
for c in datetime_columns:
print(f"Timezone for {c} is {data[c].dt.tz}")
# Adding columns for date & time
data['year'] = data[col].dt.year
data['quarter'] = data[col].dt.quarter
data['month'] = data[col].dt.month
data['week'] = data[col].dt.isocalendar().week
data['hour'] = data[col].dt.hour
data['day_of_year'] = data[col].dt.day_of_year
data['day_of_month'] = data[col].dt.day
data['day_of_week'] = data[col].dt.day_of_week
return data
# calculate sum of production of business producers and the average over the day of the week
business_production_sum_timeseries = merged_df.query('is_consumption == 0 & is_business == 1')[["datetime", "target"]].groupby("datetime").sum()
business_production_sum_timeseries.reset_index(inplace=True)
business_production_sum_timeseries = split_datetime(business_production_sum_timeseries, "datetime")
# Have a look at the production average per day of the week
business_production_av_day_of_week = business_production_sum_timeseries.groupby(["day_of_week", "hour"])[["target"]].mean()
business_production_av_day_of_week.reset_index(inplace=True)
# calculate sum of production of non-business producers and the average over the day of the week
non_business_production_sum_timeseries = merged_df.query('is_consumption == 0 & is_business == 0')[["datetime", "target"]].groupby("datetime").sum()
non_business_production_sum_timeseries.reset_index(inplace=True)
non_business_production_sum_timeseries = split_datetime(non_business_production_sum_timeseries, "datetime")
# Have a look at the production average per day of the week
non_business_production_av_day_of_week = non_business_production_sum_timeseries.groupby(["day_of_week", "hour"])[["target"]].mean()
non_business_production_av_day_of_week.reset_index(inplace=True)
plt.figure(figsize=(12,8))
plt.plot(business_production_av_day_of_week.index, business_production_av_day_of_week["target"], label="business")
plt.plot(non_business_production_av_day_of_week.index, non_business_production_av_day_of_week["target"], label="non-business")
plt.legend()
Timezone for datetime is None Timezone for datetime is None
<matplotlib.legend.Legend at 0x332a20e50>
Production for businesses is higher on the weekends
# calculate sum of consumption of business producers and the average over the day of the week
business_consumption_sum_timeseries = merged_df.query('is_consumption == 1 & is_business == 1')[["datetime", "target"]].groupby("datetime").sum()
business_consumption_sum_timeseries.reset_index(inplace=True)
business_consumption_sum_timeseries = split_datetime(business_consumption_sum_timeseries, "datetime")
# Have a look at the consumption average per day of the week
business_consumption_av_day_of_week = business_consumption_sum_timeseries.groupby(["day_of_week", "hour"])[["target"]].mean()
business_consumption_av_day_of_week.reset_index(inplace=True)
# calculate sum of consumption of business producers and the average over the day of the week
non_business_consumption_sum_timeseries = merged_df.query('is_consumption == 1 & is_business == 0')[["datetime", "target"]].groupby("datetime").sum()
non_business_consumption_sum_timeseries.reset_index(inplace=True)
non_business_consumption_sum_timeseries = split_datetime(non_business_consumption_sum_timeseries, "datetime")
# Have a look at the consumption average per day of the week
non_business_consumption_av_day_of_week = non_business_consumption_sum_timeseries.groupby(["day_of_week", "hour"])[["target"]].mean()
non_business_consumption_av_day_of_week.reset_index(inplace=True)
plt.figure(figsize=(12,8))
plt.plot(business_consumption_av_day_of_week.index, business_consumption_av_day_of_week["target"], label="business")
plt.plot(non_business_consumption_av_day_of_week.index, non_business_consumption_av_day_of_week["target"], label="non-business")
plt.legend()
Timezone for datetime is None Timezone for datetime is None
<matplotlib.legend.Legend at 0x3e4a12950>
small_producers = merged_df.query('is_consumption == 0 & installed_capacity_client < 1000')
small_producers_sum_timeseries = small_producers[["datetime", "county", "target", "surface_solar_radiation_downwards_forecast_weather", "installed_capacity_client"]].groupby(["county", "datetime"]).agg({"target": "sum", "surface_solar_radiation_downwards_forecast_weather": "mean", "installed_capacity_client": "mean"})
small_producers_sum_timeseries.reset_index(inplace=True)
plt.figure(figsize=(12,8))
sns.scatterplot(data=small_producers_sum_timeseries.query("county == 0"), x="surface_solar_radiation_downwards_forecast_weather", y="target", hue="installed_capacity_client")
large_producers = merged_df.query('is_consumption == 0 & installed_capacity_client >= 1000')
large_producers_sum_timeseries = large_producers[["datetime", "county", "target", "surface_solar_radiation_downwards_forecast_weather", "installed_capacity_client"]].groupby(["county", "datetime"]).agg({"target": "sum", "surface_solar_radiation_downwards_forecast_weather": "mean", "installed_capacity_client": "mean"})
large_producers_sum_timeseries.reset_index(inplace=True)
plt.figure(figsize=(12,8))
sns.scatterplot(data=large_producers_sum_timeseries.query("county == 0"), x="surface_solar_radiation_downwards_forecast_weather", y="target", hue="installed_capacity_client")
<Axes: xlabel='surface_solar_radiation_downwards_forecast_weather', ylabel='target'>
Solar radiation and production seems to be linear, but the line differs from how much capacity you have (more or less two lines running from the same origin, like a "star").
px.line(merged_df.groupby(['county', 'datetime', 'is_consumption']).mean(numeric_only=True).reset_index(),
x='datetime', y='target', color='county', facet_col='is_consumption')
import plotly.graph_objects as go
from plotly.subplots import make_subplots
# trace 0 = surface solar radat
# trace 1 = target
# trace 2 = sum of installed capacity
#df = merged_df.query('(is_business not in [1]) and (is_consumption == 0)').groupby(['datetime']).mean(numeric_only=True).reset_index()
aggregations = {'target': 'sum', 'surface_solar_radiation_downwards_forecast_weather': 'mean', 'installed_capacity_client': 'sum'}
try:
df = merged_df.query('(is_business not in [1]) and (is_consumption == 0)').groupby(['datetime']).aggregate(aggregations).reset_index()
except:
print('mean applied')
df = merged_df.query('(is_business not in [1]) and (is_consumption == 0)').groupby(['datetime']).aggregate('mean').reset_index()
fig = make_subplots(specs=[[{"secondary_y": True}]])
fig.add_trace(
go.Scatter(x=df['datetime'], y=df['surface_solar_radiation_downwards_forecast_weather'], opacity=0.7),
secondary_y=True,
)
fig.add_trace(
go.Scatter(x=df['datetime'], y=df['target'], opacity=0.7),
secondary_y=False,
)
fig.add_trace(
go.Scatter(x=df['datetime'], y=df['installed_capacity_client'], opacity=0.7),
secondary_y=False,
)
# Disable hover information and zooming
fig
merged_df.head()
| county | is_business | product_type | target | is_consumption | datetime | data_block_id | row_id | prediction_unit_id | eic_count_client | ... | cloudcover_low_forecast_weather | cloudcover_mid_forecast_weather | cloudcover_total_forecast_weather | 10_metre_u_wind_component_forecast_weather | 10_metre_v_wind_component_forecast_weather | direct_solar_radiation_forecast_weather | surface_solar_radiation_downwards_forecast_weather | snowfall_forecast_weather | total_precipitation_forecast_weather | datetime_EET | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 1 | 0.713 | 0 | 2021-09-01 | 0 | 0 | 0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 1 | 0 | 0 | 1 | 96.590 | 1 | 2021-09-01 | 0 | 1 | 0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 2 | 0 | 0 | 2 | 0.000 | 0 | 2021-09-01 | 0 | 2 | 1 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 3 | 0 | 0 | 2 | 17.314 | 1 | 2021-09-01 | 0 | 3 | 1 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 4 | 0 | 0 | 3 | 2.904 | 0 | 2021-09-01 | 0 | 4 | 2 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
5 rows × 55 columns
Hypotheses and baseline model ideas¶
- Simple Baseline Ideas
Simple baseline model: Average (average) consumption of day before this year would prediction consumption for next year
How is weather of day before is influencing production and consumption (if sunshine day before is low, production next day would possibly be lower)
Very simple regression model, with few features (solar radiation colors, snow/temperature, capacity)
- Correlation
- What are parameters for the highest correlation for consumption (e.g. higher solar radiation, the higher the production; the higher the capacity, the higher the production)
- Gas prices and electricity prices change -> correlation
- Two models: production (solar and capacity), consumption (temperature); target represents if consumption or production
Feature engineering¶
Add columns for date / time¶
def add_features(data):
# What columns are of type datetime?
datetime_columns = merged_df.select_dtypes(include='datetime64').columns
for c in datetime_columns:
print(f"Timezone for {c} is {data[c].dt.tz}")
# Adding columns for date & time
data['year'] = data['datetime'].dt.year
data['quarter'] = data['datetime'].dt.quarter
data['month'] = data['datetime'].dt.month
data['week'] = data['datetime'].dt.isocalendar().week
data['hour'] = data['datetime'].dt.hour
data['day_of_year'] = data['datetime'].dt.day_of_year
data['day_of_month'] = data['datetime'].dt.day
data['day_of_week'] = data['datetime'].dt.day_of_week
return data
merged_df = add_features(merged_df)
## -> need to convert to EET timezone
Timezone for datetime is None Timezone for date_client is None Timezone for forecast_date_gas_prices is None Timezone for origin_date_gas_prices is None Timezone for forecast_date_electricity_prices is None Timezone for origin_date_electricity_prices is None Timezone for datetime_hist_weather is None
Baseline model¶
model ideas
- calculate average of one year, use this as a prediction for next year ? (but we know already that seasonality is important)
- ** prediction of t is equal to t-1year **
to do
- define x and y, only 'model' on most recent year (2023?)
test = train.copy()
test.head()
| county | is_business | product_type | target | is_consumption | datetime | data_block_id | row_id | prediction_unit_id | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 1 | 0.713 | 0 | 2021-09-01 | 0 | 0 | 0 |
| 1 | 0 | 0 | 1 | 96.590 | 1 | 2021-09-01 | 0 | 1 | 0 |
| 2 | 0 | 0 | 2 | 0.000 | 0 | 2021-09-01 | 0 | 2 | 1 |
| 3 | 0 | 0 | 2 | 17.314 | 1 | 2021-09-01 | 0 | 3 | 1 |
| 4 | 0 | 0 | 3 | 2.904 | 0 | 2021-09-01 | 0 | 4 | 2 |
# create Baseline Model
test_offset = train.copy()
test_offset
test_offset['datetime'] = test_offset['datetime'] + pd.Timedelta(value=365, unit='days') # PLUS or MINUS?
test_offset.head()
| county | is_business | product_type | target | is_consumption | datetime | data_block_id | row_id | prediction_unit_id | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 1 | 0.713 | 0 | 2022-09-01 | 0 | 0 | 0 |
| 1 | 0 | 0 | 1 | 96.590 | 1 | 2022-09-01 | 0 | 1 | 0 |
| 2 | 0 | 0 | 2 | 0.000 | 0 | 2022-09-01 | 0 | 2 | 1 |
| 3 | 0 | 0 | 2 | 17.314 | 1 | 2022-09-01 | 0 | 3 | 1 |
| 4 | 0 | 0 | 3 | 2.904 | 0 | 2022-09-01 | 0 | 4 | 2 |
# merge df with baseline predictions (in test_offset)
test = test.merge(
test_offset,
on=['county', 'is_business', 'product_type', 'is_consumption', 'datetime'],
how='left',
suffixes=('', '_previous_year')
)
# idea: define function to pull target from 1 year ago
'''def baseline(date):
prediction = target from one year ago
return prediction'''
'def baseline(date):\n prediction = target from one year ago \n\n return prediction'
from sklearn.metrics import mean_absolute_error
test.dropna(inplace=True)
print('Previous year as prediction:', mean_absolute_error(test['target'], test['target_previous_year']))
Previous year as prediction: 134.51285907725816
# define error fun
def calc_mae(x, y):
mae = np.mean(np.abs(y - x))
return mae
# generate df for plotting
plotdat = test.groupby(['datetime', 'is_consumption']).mean().reset_index()
print('MAE consumption: ', calc_mae(plotdat.query('is_consumption == 1')['target'], plotdat.query('is_consumption == 1')['target_previous_year']))
print('MAE production: ', calc_mae(plotdat.query('is_consumption == 0')['target'], plotdat.query('is_consumption == 0')['target_previous_year']))
print('MAE all: ', calc_mae(plotdat['target'], plotdat['target_previous_year']))
MAE consumption: 120.16727685427284 MAE production: 55.55173923700824 MAE all: 87.85950804564054
Results of pre-implemented sklearn MAE are different from the defined calc_mae function. It is unclear why, but Kaggle uses the pre-implemented function -> this is what we should optimize for.
fig = make_subplots(specs=[[{"secondary_y": True}]])
test_prod = test.query('is_consumption == 0')
fig.add_trace(
go.Scatter(x=test_prod['datetime'], y=test_prod['target'], opacity=0.7, name='production_true'),
secondary_y=False,
)
fig.add_trace(
go.Scatter(x=test_prod['datetime'], y=test_prod['target_previous_year'], opacity=0.7, name='production_pred'),
secondary_y=False,
)
# Disable hover information and zooming
fig
# plot actual and baseline-predicted consumption
plt.plot(plotdat.query('is_consumption == 1')['datetime'], plotdat.query('is_consumption == 1')['target'])
plt.plot(plotdat.query('is_consumption == 1')['datetime'], plotdat.query('is_consumption == 1')['target_previous_year'], 'r')
plt.legend(['obs.', 'pred.'])
plt.title('Baseline model: observed and predicted consumption')
Text(0.5, 1.0, 'Baseline model: observed and predicted consumption')
# plot actual and baseline-predicted production
plt.plot(plotdat.query('is_consumption == 0')['datetime'], plotdat.query('is_consumption == 0')['target'])
plt.plot(plotdat.query('is_consumption == 0')['datetime'], plotdat.query('is_consumption == 0')['target_previous_year'], 'r')
plt.legend(['obs.', 'pred.'])
plt.title('Baseline model: observed and predicted production')
Text(0.5, 1.0, 'Baseline model: observed and predicted production')
# plot actual and baseline-predicted production
# average over data block id, because during night (no sun), production is 0 and therefore plot becomes unreadable
plotdat2 = plotdat.groupby(['data_block_id', 'is_consumption']).mean().reset_index()
plt.plot(plotdat2.query('is_consumption == 0')['datetime'], plotdat2.query('is_consumption == 0')['target'])
plt.plot(plotdat2.query('is_consumption == 0')['datetime'], plotdat2.query('is_consumption == 0')['target_previous_year'], 'r')
plt.legend(['obs.', 'pred.'])
plt.title('Baseline model: observed and predicted production')
Text(0.5, 1.0, 'Baseline model: observed and predicted production')
Modelling¶
del train
del client
del gas_prices
del electricity_prices
del forecast_weather
del historical_weather
# copy df for modelling
model_df = merged_df.copy()
# model is not able to handle object type
model_df.drop('time_of_day', axis=1, inplace=True)
# split datetime into meaningful features of int types
model_df = split_datetime(model_df)
# model is not able to handle datetime
model_df = model_df.drop(model_df.select_dtypes(include=['datetime64[ns]', 'datetime64[ns, EET]']).columns, axis=1)
# drop na from target
model_df.dropna(subset=['target'], inplace=True)
Timezone for datetime is None Timezone for date_client is None Timezone for forecast_date_gas_prices is None Timezone for origin_date_gas_prices is None Timezone for forecast_date_electricity_prices is None Timezone for origin_date_electricity_prices is None Timezone for datetime_hist_weather is None
import xgboost as xgb
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(model_df.drop('target', axis=1), model_df['target'], test_size=0.3, random_state=0)
bst = XGBRegressor(enable_categorical=True)
bst.fit(X_train, y_train)
y_pred = bst.predict(X_test)
# main optimisation metric
print('Mean absolute error test', mean_absolute_error(y_test, y_pred))
print('Mean absolute error train', mean_absolute_error(y_train, bst.predict(X_train)))
Mean absolute error test 50.735112752114915 Mean absolute error train 49.32964331672551
# first attempt gave us 50.75 mean absolute error
model_df['data_block_id'].describe()
count 2.017824e+06 mean 3.218754e+02 std 1.826310e+02 min 0.000000e+00 25% 1.660000e+02 50% 3.230000e+02 75% 4.790000e+02 max 6.370000e+02 Name: data_block_id, dtype: float64
# split of old data to train and newer one to test
Xy_train = model_df[model_df.data_block_id < 450]
X_train = Xy_train.drop('target', axis=1)
y_train = Xy_train.target
Xy_test = model_df[model_df.data_block_id >= 450]
X_test = Xy_test.drop('target', axis=1)
y_test = Xy_test.target
bst = XGBRegressor(enable_categorical=True)
bst.fit(X_train, y_train)
y_pred = bst.predict(X_test)
# main optimisation metric
print('Mean absolute error test', mean_absolute_error(y_test, y_pred))
print('Mean absolute error train', mean_absolute_error(y_train, bst.predict(X_train)))
Mean absolute error test 109.3723796305108 Mean absolute error train 44.359425148907334
# divide by dates and use newer ones for validation
bst.feature_names_in_
array(['county', 'is_business', 'product_type', 'is_consumption',
'data_block_id', 'row_id', 'prediction_unit_id',
'eic_count_client', 'installed_capacity_client',
'lowest_price_per_mwh_gas_prices',
'highest_price_per_mwh_gas_prices',
'euros_per_mwh_electricity_prices', 'temperature_hist_weather',
'dewpoint_hist_weather', 'rain_hist_weather',
'snowfall_hist_weather', 'surface_pressure_hist_weather',
'cloudcover_total_hist_weather', 'cloudcover_low_hist_weather',
'cloudcover_mid_hist_weather', 'cloudcover_high_hist_weather',
'windspeed_10m_hist_weather', 'winddirection_10m_hist_weather',
'shortwave_radiation_hist_weather',
'direct_solar_radiation_hist_weather',
'diffuse_radiation_hist_weather', 'latitude_hist_weather',
'longitude_hist_weather', 'latitude_forecast_weather',
'longitude_forecast_weather', 'hours_ahead_forecast_weather',
'temperature_forecast_weather', 'dewpoint_forecast_weather',
'cloudcover_high_forecast_weather',
'cloudcover_low_forecast_weather',
'cloudcover_mid_forecast_weather',
'cloudcover_total_forecast_weather',
'10_metre_u_wind_component_forecast_weather',
'10_metre_v_wind_component_forecast_weather',
'direct_solar_radiation_forecast_weather',
'surface_solar_radiation_downwards_forecast_weather',
'snowfall_forecast_weather',
'total_precipitation_forecast_weather', 'year', 'quarter', 'month',
'week', 'hour', 'day_of_year', 'day_of_month', 'day_of_week'],
dtype='<U50')
bst.feature_importances_
array([0.02317161, 0.19814272, 0.0029233 , 0.16164385, 0.00755489,
0.01130688, 0.0255705 , 0.01065247, 0.20469157, 0.00137255,
0.00233338, 0.0032099 , 0.00840222, 0.00161291, 0.00115936,
0.00040205, 0.00126914, 0.00026398, 0.00053658, 0.00101611,
0.00041738, 0.00092538, 0.00074864, 0.00248047, 0.00115231,
0.00131035, 0.01265862, 0.00617154, 0.04587046, 0.00263224,
0.01358437, 0.02500847, 0.00102796, 0.0005635 , 0.01480848,
0.00196344, 0.00156044, 0.00141749, 0.00141629, 0.00660226,
0.06700346, 0.03550554, 0.00196549, 0. , 0.00048924,
0.00194408, 0.0076613 , 0.02466489, 0.01165206, 0.00241098,
0.0371469 ], dtype=float32)
features_dict = {key: value for key, value in zip(bst.feature_names_in_, bst.feature_importances_)}
features_dict
{'county': 0.02317161,
'is_business': 0.19814272,
'product_type': 0.002923302,
'is_consumption': 0.16164385,
'data_block_id': 0.0075548883,
'row_id': 0.011306881,
'prediction_unit_id': 0.0255705,
'eic_count_client': 0.0106524695,
'installed_capacity_client': 0.20469157,
'lowest_price_per_mwh_gas_prices': 0.0013725488,
'highest_price_per_mwh_gas_prices': 0.0023333782,
'euros_per_mwh_electricity_prices': 0.0032099008,
'temperature_hist_weather': 0.008402225,
'dewpoint_hist_weather': 0.0016129076,
'rain_hist_weather': 0.0011593648,
'snowfall_hist_weather': 0.00040205359,
'surface_pressure_hist_weather': 0.0012691438,
'cloudcover_total_hist_weather': 0.0002639827,
'cloudcover_low_hist_weather': 0.0005365753,
'cloudcover_mid_hist_weather': 0.0010161146,
'cloudcover_high_hist_weather': 0.00041738356,
'windspeed_10m_hist_weather': 0.0009253759,
'winddirection_10m_hist_weather': 0.0007486421,
'shortwave_radiation_hist_weather': 0.0024804687,
'direct_solar_radiation_hist_weather': 0.001152306,
'diffuse_radiation_hist_weather': 0.0013103458,
'latitude_hist_weather': 0.012658624,
'longitude_hist_weather': 0.00617154,
'latitude_forecast_weather': 0.04587046,
'longitude_forecast_weather': 0.0026322408,
'hours_ahead_forecast_weather': 0.013584371,
'temperature_forecast_weather': 0.025008474,
'dewpoint_forecast_weather': 0.0010279635,
'cloudcover_high_forecast_weather': 0.000563501,
'cloudcover_low_forecast_weather': 0.014808475,
'cloudcover_mid_forecast_weather': 0.0019634373,
'cloudcover_total_forecast_weather': 0.0015604437,
'10_metre_u_wind_component_forecast_weather': 0.0014174874,
'10_metre_v_wind_component_forecast_weather': 0.001416292,
'direct_solar_radiation_forecast_weather': 0.0066022635,
'surface_solar_radiation_downwards_forecast_weather': 0.06700346,
'snowfall_forecast_weather': 0.035505537,
'total_precipitation_forecast_weather': 0.0019654892,
'year': 0.0,
'quarter': 0.0004892406,
'month': 0.0019440779,
'week': 0.0076612984,
'hour': 0.02466489,
'day_of_year': 0.011652055,
'day_of_month': 0.002410979,
'day_of_week': 0.037146896}
xgb.plot_importance(bst)
plt.title('Feature Importance')
plt.show()
# hours_ahead_forecast treated as important feature, probably smth to drop)
- visualisation
- split by date
- tweaking the parameters
- drop some features
- feature engineering
- overfitting with traditional train_test_split?
- try to models/ multiple_output/ other models
# model_df2 = model_df.copy()
# model_df2.drop(['row_id', ])
split_datablock = 300
Xy_train = model_df[model_df.data_block_id < split_datablock]
X_train = Xy_train.drop('target', axis=1)
y_train = Xy_train.target
Xy_test = model_df[model_df.data_block_id >= split_datablock]
X_test = Xy_test.drop('target', axis=1)
y_test = Xy_test.target
bst = XGBRegressor(enable_categorical=True)
bst.fit(X_train, y_train)
y_pred_test = bst.predict(X_test)
y_pred_train = bst.predict(X_train)
# main optimisation metric
print('Mean absolute error test', mean_absolute_error(y_test, y_pred_test))
print('Mean absolute error train', mean_absolute_error(y_train, y_pred_train))
Mean absolute error test 143.24422275958875 Mean absolute error train 39.44093814710219
#px.line(x=Xy_train.index, y=y_pred_train-y_train, color=Xy_train.month)
#px.line(data_frame=Xy_test, x=Xy_test.index, y=y_pred_test-y_test, color=Xy_test.month, hover_data='day_of_week')
Plot y_train vs. y_pred¶
trace1 = go.Scatter(x=Xy_train.index, y=y_train, mode='lines', name='y_train', fill='none')
trace2 = go.Scatter(x=Xy_train.index, y=y_pred_train, mode='lines', name='y_pred_train', fill='none')
layout = go.Layout(title='Comparison of y_train and y_pred_train', xaxis=dict(title='Index'), yaxis=dict(title='Values'))
fig = go.Figure(data=[trace1, trace2], layout=layout)
# Show the plot
fig.show()
Target consumption vs. production¶
merged_cons = merged_df[merged_df['is_consumption'] == 1].copy()
merged_prod = merged_df[merged_df['is_consumption'] == 0].copy()
plt.plot(merged_cons.index, merged_cons['target'], label='target consumption', linestyle='-', marker='o', color='red')
plt.plot(merged_prod.index, merged_prod['target'], label='target production', linestyle='-', marker='o', color='green')
# Set labels and title
plt.xlabel('Index')
plt.ylabel('Values')
plt.title('Comparison of target consumption vs. production')
# Add legend
plt.legend()
# Show the plot
plt.show()
# Create a line plot for consumption
plt.figure(figsize=(12, 6)) # Set the figure size
plt.subplot(1, 2, 1) # Create the first subplot
for month in merged_cons['month'].unique():
data = merged_cons[merged_cons['month'] == month]
plt.plot(data.index, data['target'], label=f'Month {month}')
plt.xlabel('Index')
plt.ylabel('Target')
plt.title('Target Consumption by Month')
plt.legend()
# Create a line plot for production
plt.subplot(1, 2, 2) # Create the second subplot
for month in merged_prod['month'].unique():
data = merged_prod[merged_prod['month'] == month]
plt.plot(data.index, data['target'], label=f'Month {month}')
plt.xlabel('Index')
plt.ylabel('Target')
plt.title('Target Production by Month')
plt.legend()
# Adjust layout for better spacing
plt.tight_layout()
# Show the plots
plt.show()
merged_cons
| county | is_business | product_type | target | is_consumption | datetime | data_block_id | row_id | prediction_unit_id | eic_count_client | ... | total_precipitation_forecast_weather | datetime_EET | year | quarter | month | week | hour | day_of_year | day_of_month | day_of_week | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0 | 0 | 1 | 96.590 | 1 | 2021-09-01 00:00:00 | 0 | 1 | 0 | NaN | ... | NaN | NaT | 2021 | 3 | 9 | 35 | 0 | 244 | 1 | 2 |
| 3 | 0 | 0 | 2 | 17.314 | 1 | 2021-09-01 00:00:00 | 0 | 3 | 1 | NaN | ... | NaN | NaT | 2021 | 3 | 9 | 35 | 0 | 244 | 1 | 2 |
| 5 | 0 | 0 | 3 | 656.859 | 1 | 2021-09-01 00:00:00 | 0 | 5 | 2 | NaN | ... | NaN | NaT | 2021 | 3 | 9 | 35 | 0 | 244 | 1 | 2 |
| 7 | 0 | 1 | 0 | 59.000 | 1 | 2021-09-01 00:00:00 | 0 | 7 | 3 | NaN | ... | NaN | NaT | 2021 | 3 | 9 | 35 | 0 | 244 | 1 | 2 |
| 9 | 0 | 1 | 1 | 501.760 | 1 | 2021-09-01 00:00:00 | 0 | 9 | 4 | NaN | ... | NaN | NaT | 2021 | 3 | 9 | 35 | 0 | 244 | 1 | 2 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2018343 | 15 | 0 | 1 | 42.401 | 1 | 2023-05-31 23:00:00 | 637 | 2018343 | 57 | 51.0 | ... | 0.0 | 2023-05-31 23:00:00+03:00 | 2023 | 2 | 5 | 22 | 23 | 151 | 31 | 2 |
| 2018345 | 15 | 0 | 3 | 117.332 | 1 | 2023-05-31 23:00:00 | 637 | 2018345 | 58 | 161.0 | ... | 0.0 | 2023-05-31 23:00:00+03:00 | 2023 | 2 | 5 | 22 | 23 | 151 | 31 | 2 |
| 2018347 | 15 | 1 | 0 | 197.233 | 1 | 2023-05-31 23:00:00 | 637 | 2018347 | 64 | 15.0 | ... | 0.0 | 2023-05-31 23:00:00+03:00 | 2023 | 2 | 5 | 22 | 23 | 151 | 31 | 2 |
| 2018349 | 15 | 1 | 1 | 28.404 | 1 | 2023-05-31 23:00:00 | 637 | 2018349 | 59 | 20.0 | ... | 0.0 | 2023-05-31 23:00:00+03:00 | 2023 | 2 | 5 | 22 | 23 | 151 | 31 | 2 |
| 2018351 | 15 | 1 | 3 | 196.240 | 1 | 2023-05-31 23:00:00 | 637 | 2018351 | 60 | 55.0 | ... | 0.0 | 2023-05-31 23:00:00+03:00 | 2023 | 2 | 5 | 22 | 23 | 151 | 31 | 2 |
1009176 rows × 63 columns
# Set y-axis limits based on the combined data
y_min = min(merged_cons['target'])
y_max = max(merged_cons['target'])
# Create a line plot for consumption
plt.figure(figsize=(10, 5)) # Set the figure size
plt.plot(merged_cons['week'], merged_cons['target'], label='Consumption', linestyle='-', marker='o')
plt.xlabel('Week')
plt.ylabel('Target')
plt.title('Target Consumption by Week')
plt.ylim(y_min, y_max) # Set y-axis limits
plt.legend()
# Create a line plot for production
plt.figure(figsize=(10, 5)) # Set the figure size
plt.plot(merged_prod['week'], merged_prod['target'], label='Production', linestyle='-', marker='o')
plt.xlabel('Week')
plt.ylabel('Target')
plt.title('Target Production by Week')
plt.ylim(y_min, y_max) # Set y-axis limits
plt.legend()
# Show the plots
plt.show()
target_column = 'residual'
# Exclude non-numeric columns
numeric_columns = Xy_test.select_dtypes(include=['number']).columns
numeric_df = Xy_test[numeric_columns]
numeric_df_cons = numeric_df[numeric_df['is_consumption'] == 1]
# Calculate the correlation matrix
correlation_matrix = numeric_df_cons.corr()
# Select correlations based on the threshold
threshold = 0.15
significant_correlations = correlation_matrix[(correlation_matrix[target_column] > threshold) | (correlation_matrix[target_column] < -threshold)][target_column]
# Plot a heatmap of the significant correlations with the target
plt.figure(figsize=(12, 8))
sns.heatmap(significant_correlations.to_frame(), annot=True, cmap='coolwarm', fmt=".2f", cbar=False)
plt.title(f'Significant Correlations with {target_column}, CONSUM ONLY (Threshold: {threshold})')
plt.show()
target_column = 'residual'
# Exclude non-numeric columns
numeric_columns = Xy_test.select_dtypes(include=['number']).columns
numeric_df = Xy_test[numeric_columns]
numeric_df_cons = numeric_df[numeric_df['is_consumption'] == 0]
# Calculate the correlation matrix
correlation_matrix = numeric_df_cons.corr()
# Select correlations based on the threshold
threshold = 0.15
significant_correlations = correlation_matrix[(correlation_matrix[target_column] > threshold) | (correlation_matrix[target_column] < -threshold)][target_column]
# Plot a heatmap of the significant correlations with the target
plt.figure(figsize=(12, 8))
sns.heatmap(significant_correlations.to_frame(), annot=True, cmap='coolwarm', fmt=".2f", cbar=False)
plt.title(f'Significant Correlations with {target_column}, PRODUCTION ONLY (Threshold: {threshold})')
plt.show()
residuals are bigger at the summer time, we quess because production is happening at this time
residuals on the test data have weekly pattern
last two month predicted very poorly
residuals are different depending on how we split our data
we see unexpalinable patterns in residuals
residuals for consumption and production correlate with different features
try residual analysis with traditional test_train_split
tweak the model